All blog posts

Public Single-cell RNA-Seq Databases: A Comprehensive Overview

Explore an extensive, curated list of public scRNA-seq databases, their key features, advantages, limitations, and metadata standards.
Blog
Research Insights
scRNA-seq Database, gene expression database

Comprehensive Guide to Public Single-Cell RNA-Seq Databases and Resources

Single-cell RNA sequencing (scRNA-seq) has yielded vast datasets covering diverse organisms, tissues, and disease states. Numerous public databases aggregate and curate these datasets to facilitate reuse by researchers. Below we compile an extensive list of credible, active scRNA-seq databases (global and regional), and discuss their features, advantages, limitations, metadata standards, accessibility, and scientific relevance. This report is organized into structured sections for clarity.

1. Catalog of Public scRNA-Seq Databases

The table below summarizes key public databases and portals for single-cell transcriptomic data. Each entry includes the database name, a brief description, access link, and key features (e.g. number of datasets/cells, species coverage, data types). All listed resources are backed by peer-reviewed publications or major consortia, and mostly remain actively maintained.

Database NameDescriptionScope
Arc Virtual Cell Atlas: scBaseCampAI-curated repository integrating >230 million single-cell profiles across multiple species and tissues, with standardized metadata.General, multi-species
Arc Virtual Cell Atlas: Tahoe-100MMassive perturbation atlas containing 100 million transcriptomic profiles from ~60,000 drug perturbation experiments across 50 cancer cell lines.Perturbation-focused
Cancer Single-cell Expression Map (CancerSCEM)Integrates and visualizes scRNA-seq data from human cancers, providing extensive multidimensional analyses, including metabolic profiling.Cancer-focused
Cell Types Database: RNA-Seq Data (Allen Institute)Offers extensive single-cell and single-nucleus transcriptomic data, linked to morphological and electrophysiological characterizations of neurons from human and mouse brain regions.Tissue-specific (brain)
CellxGene CensusOffers extensive single-cell and single-nucleus transcriptomic data, linked to morphological and electrophysiological characterizations of neurons from human and mouse brain regions.Tissue-specific (brain)
DISCOAggregates over 100 million cells from publicly available single-cell datasets, harmonized to facilitate consistent analysis across studies.General
Human Cell Atlas (HCA)A global effort to build comprehensive reference maps of all human cells, facilitating insights into human health, disease, and development.General (human)
Integrates scRNA-seq data from 31 human tissues with protein-level immunohistochemical staining, linking transcriptomic and proteomic profiles.General (human)
PanglaoDBDatabase of mouse and human scRNA-seq experiments with pre-annotated cell-type markers, facilitating gene and cell-type exploration.General
Perturbation Atlas (Perturb-seq)Resource containing single-cell RNA-seq data systematically capturing cellular responses to genetic and chemical perturbations for functional genomic insights.Perturbation-focused
scRNASeqDBCurated database of 36 human single-cell gene expression datasets from GEO, involving 8,910 cells categorized into 174 cell groups.General
Single Cell Expression Atlas (SCEA)Cross-species repository providing uniformly processed single-cell RNA-seq data to facilitate cross-study comparisons and gene-expression searches.General, multi-species
Single Cell Portal (Broad Institute)Broad Institute's portal hosting single-cell datasets including contributions from consortia like Human Cell Atlas, with user-friendly visual exploration tools.General
Tumor Immune Single-cell Hub (TISCH2)Dedicated to tumor microenvironment analysis, providing detailed single-cell annotations of immune and stromal populations across cancer datasets.Cancer-focused
UCSC Genome Browser: Single Cell RNA-SeqSingle-cell RNA expression datasets from various human tissues (e.g., kidney, colon, heart, muscle, placenta, peripheral blood mononuclear cells), accessible via UCSC Genome Browser tracks.General (human tissues)

2. Advantages and Limitations of These Databases

Advantages

Data Reuse and Discovery:

  • Researchers can query these databases to find if a cell type or condition of interest has been profiled before, avoiding redundant experiments. For example, a gene search in PanglaoDB, SCEA, and Arc Virtual Cell Atlas instantly shows its expression across dozens of tissues and studies, enabling hypothesis generation.
  • Cross-tissue portals (HCA, SCEA, and Arc Virtual Cell Atlas) allow comparing gene expression in different organs to find common or unique cell-type markers. This is valuable for broad explorations (e.g., identifying a gene’s expression pattern across the body or surveying all cell types present in a disease).

Large Sample Sizes:

  • By integrating many studies, these databases provide enormous aggregate cell counts (often millions of cells), boosting statistical power for detecting rare cell populations or subtle gene expression changes.
  • The Human Lung Cell Atlas pooled 49 datasets (~2.4 million cells) to create a consensus lung cell reference, enabling discovery of rare cell types or states that single studies might miss and supporting meta-analyses (finding robust patterns consistent across studies).
  • The Arc Virtual Cell Atlas integrates over 300 million single-cell profiles across species and tissues, expanding comparative research potential and enabling large-scale cross-species analysis. This makes it one of the largest single-cell databases available.

Domain-Specific Insights:

  • Tissue-Specific Atlases: The Allen Brain Cell Types Database helps neuroscientists explore neuronal subtypes with rich context (including morphology and electrophysiology data). The UCSC Genome Browser provides access to a variety of tissue-specific single-cell datasets.
  • Disease-Focused Databases: TISCH2, CancerSCEM, and Arc Virtual Cell Atlas include large-scale datasets on tumor biology, immune infiltration, and drug responses. Cancer-focused repositories allow researchers to examine tumor-infiltrating immune cells or tumor heterogeneity across multiple cohorts, accelerating discovery of tumor biomarkers and immunotherapy targets.
  • Perturbational Data: The Perturbation Atlas (Perturb-seq) systematically compiles scRNA-seq datasets from genetic and chemical perturbations, enabling systematic studies of gene functions and molecular mechanisms underlying cellular responses. Similarly, the Arc Virtual Cell Atlas hosts the Tahoe-100M dataset, a resource comprising transcriptomic profiles from 100 million cells across approximately 60,000 perturbation experiments, making it one of the most extensive resources for pharmacogenomic and drug-response studies.

Ease of Access and Analysis:

  • Many portals provide user-friendly web interfaces with interactive visualizations (UMAP plots, heatmaps, cell-type clustering tools), lowering the barrier for biologists who may not be experts in bioinformatics.
  • User-friendly portals like cellxgene, Single Cell Portal, and CancerSCEM enable non-bioinformaticians to explore data interactively without needing programming expertise.
  • The Arc Virtual Cell Atlas offers bulk data downloads with structured metadata and an open GitHub repository for programmatic access. This makes it highly scalable for computational research and machine-learning applications.

Limitations

Data Processing Heterogeneity:

  • Different databases apply different processing pipelines and quality controls. General-purpose repositories like GEO/SRA or portals such as SCP often host author-submitted data without uniform reprocessing, meaning batch effects and platform differences remain. This variability can complicate downstream integration and comparison across datasets.
  • Platforms like Single Cell Expression Atlas (SCEA) and the Arc Virtual Cell Atlas (scBaseCamp) uniformly reprocess data, applying standardized methods and quality controls across all datasets. Uniform reprocessing significantly simplifies the integration of datasets within the same database, reducing technical biases and facilitating more robust cross-study analyses. Notably, scBaseCamp specifically focuses on datasets generated by the widely-used 10x Genomics platform, further enhancing consistency and interoperability among its extensive cell collections. However, different databases might adopt varying uniform pipelines, meaning subtle inconsistencies can still arise when integrating data across multiple resources.

Metadata and Annotation Variability:

  • The usefulness of a single-cell database depends on the quality of its metadata (e.g., correct labeling of cell types, tissues, disease status).
  • Automated metadata extraction (e.g., Arc Virtual Cell Atlas' AI curation pipeline) improves consistency but may still require manual validation.
  • The HCA Data Portal enforces metadata standardization, while other databases rely on community-supplied annotations, which may introduce variability.

Scope vs. Depth Tradeoff:

  • Cross-tissue databases (HCA, Arc Virtual Cell Atlas, DISCO) offer breadth—enabling broad comparisons—but may lack the depth of specialized resources.
  • Specialized databases (TISCH2 for tumors, UCSC Genome Browser for tissue-specific scRNA-seq) provide highly detailed views but have limited applicability beyond their specific research domains.

Data Volume and Computational Complexity:

  • Large datasets (HCA: 58M cells, DISCO: 100M+ cells, Arc Virtual Cell Atlas: 300M+ cells) require high-performance computing resources for full-scale analysis.
  • Cloud-based access (e.g., HCA, Arc Virtual Cell Atlas) helps mitigate local computing limitations but still requires bioinformatics expertise for effective data retrieval and processing.

Currency and Maintenance:

  • Some databases (e.g., PanglaoDB, last updated ~2020) may not include recent datasets, highlighting the broader challenge of maintaining databases in rapidly evolving fields such as single-cell genomics. Continuous maintenance requires substantial manual curation efforts, making sustained updating challenging over time.
  • Ongoing projects (Arc Virtual Cell Atlas, HCA, DISCO) are actively updated with new studies and annotations, ensuring researchers have access to the latest single-cell data. Notably, platforms like the Arc Virtual Cell Atlas leverage Agentic AI, an automated, scalable curation approach that continuously discovers, extracts, and integrates new datasets. This AI-driven strategy greatly reduces manual effort, promoting long-term sustainability and ensuring databases remain current as new studies emerge.

3. Metadata Standards and Data Accessibility

Metadata Standards

Minimum Information Guidelines:

  • Standards like minSCe (Minimum Information about a Single-Cell Experiment) require metadata fields covering sample origin, sequencing platform, QC metrics, and experimental design.
  • The HCA Data Portal and Single Cell Expression Atlas enforce structured metadata submission to ensure data consistency.

Use of Ontologies:

  • The Arc Virtual Cell Atlas applies AI-driven metadata extraction from SRA, improving annotation consistency across its 300M+ cells.
  • The Single Cell Expression Atlas maps metadata to EMBL-EBI’s Experimental Factor Ontology (EFO), enhancing cross-study integration.

File Formats:

  • The Arc Virtual Cell Atlas and HCA Portal provide cloud-stored datasets in AnnData format, facilitating seamless integration with Scanpy, Seurat, and Bioconductor.
  • Additionally, CellxGene provides datasets in AnnData and TileDB formats, enabling researchers to directly access and analyze data through APIs without needing to download large files locally.

Data Accessibility and Integration with Bioinformatics Tools

Download and API Access:

  • The Arc Virtual Cell Atlas provides data through Google Cloud Storage and GitHub, enabling API-based access optimized for large-scale and reproducible analyses.
  • Similarly, the CellxGene Census utilizes AWS cloud infrastructure, enabling efficient data retrieval and scalable, cloud-based analytics.
  • Generally, the adoption of major cloud platforms (AWS, Google Cloud) significantly improves data accessibility, allowing researchers to efficiently perform computationally demanding analyses without needing extensive local computing resources. This approach offers a sustainable solution compared to traditional hosting methods, streamlining data distribution, enhancing reproducibility, and promoting long-term database maintenance compared to university-based or local hosting methods.

How Can Public scRNA-seq Databases Be Integrated into Bioinformatics Workflows for Robust Biomedical Insights?

Public scRNA-seq databases provide a rich, diverse, and well-curated resource for researchers. While metadata inconsistencies and computational demands remain challenges, continuous improvements in metadata standardization, data curation, and accessibility particularly through large-scale platforms like the Arc Virtual Cell Atlas and Human Cell Atlas are significantly enhancing usability.

However, these databases typically do not offer comprehensive analytics solutions, leaving researchers to handle bioinformatics workflows independently. Companies such as Nygen Analytics address critical gaps by providing structured metadata tracking, version control, and reproducibility for single-cell data analysis. Platforms like LaminDB offer bioinformatics solutions, enabling seamless analytics for single-cell datasets by simplifying metadata tracking and analysis pipelines.

Thus, while public databases form the foundational resource for single-cell research, analytics solutions from companies such as Nygen remain essential for efficiently extracting actionable biological insights from complex single-cell datasets.

Core Community - Connecting Core Facilities Across The World

Find, merge and analyse hundreds of curated datasets on Nygen Database. Explore millions of cells in a single click. Easily publish and share your research, regardless of where your analysis was performed, with no platform restrictions. Explore Nygen Database.