
We believe in creating community-wide standards for the evaluation and curation of biological data. Our approach to standardizing genomics quality controls for WGS data follows the same philosophy, and together these frameworks cover the two most widely used sequencing modalities. As a result, we are open sourcing our quality control processing metrics for single cell RNA (scRNA-seq) sequencing data that make up our Unobio scoring algorithm. The purpose of this is to explain what we chose, why we chose it, and how we believe these evaluation metrics can support researchers to quickly evaluate single cell sequencing data.
These quality control metrics and processing is free of charge to all researchers utilizing the Unobio platform. Simply upload your single cell RNA-sequencing data and you will receive the processed and quality controlled datasets.
We will also follow up with open sourcing our code and then our scoring algorithm itself. We want to make sure the metrics we provided have time for community feedback before releasing further artifacts.
The metrics with their sources are as follows:
- Housekeeping Stability Index - Eisenberg & Levanon, Trends in Genetics (2013) (PMC6748759)
- Mitochondrial RNA Percentage Counts - HBC Training, scRNA-seq QC Tutorial
- Ribosomal RNA Percentage Counts - HBC Training, scRNA-seq QC Tutorial
- Hemoglobin Percentage - Borrelli et al., Frontiers in Molecular Biosciences (2022) (PMC9716519)
- Shannon Entropy - Grun et al., Nature Communications (2017) (ncomms15599)
- Doublet Detection - Jonathan Shor, DoubletDetection
- Signal to Noise Ratio - Mayer et al., Nucleic Acids Research (2021) (e83)
- Gini Coefficient - Jiang et al., PNAS (2018) (10.1073/pnas.1721085115)
- Gini Apoptosis/Stress/Enzyme/Heat Shock Indices - Derived from Jiang et al., PNAS (2018), applied to pathway-specific gene sets
These readouts form the basis of data preprocessing and curation underlying the Unobio scores which provide a way to filter and rank single cell sequencing datasets.
Understanding Our Quality Metrics
1. Housekeeping Stability Index
What it measures: The stability and consistency of expression across a curated set of housekeeping genes that should be relatively consistently expressed across cell types.
Why we use it: Housekeeping genes are expected to show stable expression regardless of cell type or condition. Significant variability in these genes indicates fundamental technical issues with RNA capture, library preparation, or sequencing.
Implications for interpretation: High variability in housekeeping gene expression suggests technical problems that likely affect all gene expression measurements, not just the housekeeping genes themselves. Cells with unstable housekeeping expression provide unreliable data for downstream analyses such as differential expression or trajectory inference. This metric serves as a foundational quality indicator; if housekeeping genes show irregular expression, even specialized gene signatures may be compromised.
2. Mitochondrial RNA Percentage Counts
What it measures: The percentage of total transcripts mapping to mitochondrial genes.
Why we use it: Elevated mitochondrial content is a well-established indicator of cellular stress or death. When cells begin to lyse, cytoplasmic RNA is lost while mitochondrial RNA (which is protected by the mitochondrial membrane) remains intact, resulting in artificially high mitochondrial percentages.
Implications for interpretation: Cells with high mitochondrial percentages often represent dying or damaged cells whose transcriptional profiles no longer reflect their normal physiological state. Including these cells can introduce stress-response signatures into what should be healthy cell populations. However, some cell types naturally express higher levels of mitochondrial genes (e.g., cardiomyocytes, hepatocytes), so thresholds should be cell-type aware. Datasets with generally high mitochondrial percentages may indicate poor sample handling or processing.
3. Ribosomal RNA Percentage Counts
What it measures: The percentage of transcripts mapping to ribosomal protein genes.
Why we use it: Abnormal ribosomal percentages can indicate stress responses or technical biases in RNA capture. While ribosomal transcripts are generally abundant in healthy cells, extreme values (either high or low) often signal quality issues.
Implications for interpretation: Unusually high ribosomal percentages may indicate stress responses that shift cellular resources toward protein synthesis machinery. Conversely, unusually low percentages might indicate selective loss of highly expressed transcripts during library preparation. Both scenarios potentially distort the cell's true transcriptional state. Ribosomal percentage should be evaluated in conjunction with other metrics for comprehensive quality assessment.
4. Hemoglobin Percentage
What it measures: The proportion of total transcripts mapping to hemoglobin genes (typically HBA and HBB gene families).
Why we use it: Elevated hemoglobin gene expression often indicates red blood cell contamination in non-blood samples or incomplete red blood cell lysis in blood samples.
Implications for interpretation: High hemoglobin percentages can skew normalization and downstream analyses, particularly in samples where red blood cells should be absent or rare. This contamination can artificially introduce blood-associated signatures into other cell types, potentially leading to misinterpretation of results. This metric is especially important in tissue samples with vascularization, where blood cells may be inadvertently captured along with the tissue of interest.
5. Shannon Entropy
What it measures: Shannon entropy quantifies the diversity and randomness in gene expression patterns within a cell.
Why we use it: Entropy serves as an indicator of transcriptional complexity. In high-quality cells, gene expression follows predictable patterns with moderate entropy values. Extremely low entropy suggests that too few genes are expressed (potentially indicating a low-quality or dying cell), while unusually high entropy often indicates technical noise.
Implications for interpretation: Cells with entropy values outside the expected range for their cell type should be scrutinized carefully. Low entropy cells might represent specialized cells with limited gene expression programs, but more often indicate poor-quality libraries. High entropy cells might represent technical artifacts rather than true biological diversity, potentially leading to spurious cell type identification. These kinds of hidden quality variations are examples of metadata leakage in science - subtle experimental differences that propagate and compound when datasets are compared across studies. Entropy provides a different perspective on expression distribution from metrics like the Gini coefficient, with both offering complementary insights.
6. Doublet Detection
What it measures: Computational approaches to identify potential doublets—instances where two or more cells are inadvertently captured and sequenced as a single cell.
Why we use it: Doublets are unavoidable in most scRNA-seq protocols, with rates typically increasing with cell loading density. They create artificial "hybrid" transcriptional profiles that can be misinterpreted as intermediate cell states or novel cell types.
Implications for interpretation: Undetected doublets can severely confound lineage inference, pseudotime analysis, and cell type identification. They create apparent transitional states or false cell types that don't exist in vivo. By implementing robust doublet detection, researchers can maintain the fidelity of single-cell resolution and avoid erroneous biological conclusions. Various computational approaches exist, with different sensitivity and specificity profiles, so careful implementation is essential.
7. Signal to Noise Ratio
What it measures: The ratio of biological signal to technical noise in scRNA-seq data.
Why we use it: A higher signal-to-noise ratio (SNR) indicates more reliable data where biological variation dominates over technical variation. This metric helps ensure that downstream analyses capture meaningful biological differences rather than technical artifacts.
Implications for interpretation: Datasets or cells with low SNR may lead to spurious identification of differentially expressed genes or cellular states that reflect technical variation rather than biology. Conversely, high SNR datasets provide greater confidence in subtle biological signals, enabling the detection of rare cell types and transitional states. SNR can be calculated at different levels (gene, cell, cluster) to evaluate quality from multiple perspectives.
8. Gini Coefficient
What it measures: The inequality of overall gene expression distribution within a cell.
Why we use it: A high Gini coefficient often indicates poor-quality cells or technical artifacts where a few genes account for most of the transcripts. By measuring expression inequality, we can identify cells that deviate from expected biological patterns.
Implications for interpretation: Cells with unusually high Gini coefficients typically show dominant expression of a few genes, which may not reflect their true physiological state. Including these cells can introduce artificial heterogeneity in downstream analyses. The Gini coefficient provides a statistical framework for evaluating expression distribution that complements other metrics and serves as the foundation for the specialized Gini indices described below.
9. Gini Apoptosis Index
What it measures: Expression inequality among apoptosis-related genes.
Why we use it: Cells undergoing programmed cell death show characteristic gene expression patterns where specific apoptotic markers become highly upregulated.
Implications for interpretation: A high Gini coefficient for apoptosis genes indicates that a few apoptotic markers dominate the expression profile, suggesting the cell was dying during sample preparation. These dying cells often introduce technical noise and biological artifacts that can confound analyses of normal cellular states, particularly in processes like differentiation that can superficially resemble apoptosis. Identifying and removing these cells prevents contamination of biological signals with death-related signatures.
10. Gini Cell Stress Index
What it measures: Expression inequality among cell stress response genes.
Why we use it: Cells experiencing stress from dissociation protocols, FACS sorting, or extended processing times upregulate specific stress response pathways.
Implications for interpretation: A high stress index suggests that the cell was experiencing significant stress during collection or preparation, potentially altering its transcriptional profile away from its natural state in the tissue. This can lead to misinterpretation of cellular states and artificial clustering of cells based on shared stress responses rather than biological identity. By identifying stressed cells, researchers can account for technical stress signatures in their analyses or remove severely affected cells.
11. Gini Enzyme Induced Index
What it measures: Expression inequality among genes known to be artificially induced during enzymatic dissociation of tissues.
Why we use it: Enzymatic tissue dissociation methods (using collagenase, trypsin, etc.) can trigger specific gene expression responses that don't reflect the original in vivo state.
Implications for interpretation: Cells with high enzyme-induced index values were strongly affected by the dissociation process, potentially introducing technical artifacts that mask their true biological state. This metric is particularly important when comparing datasets generated using different dissociation protocols. Understanding dissociation-induced signatures allows researchers to distinguish technical variation from true biological differences, especially in cross-dataset integration.
12. Gini Heat Shock Index
What it measures: Expression inequality among heat shock protein genes and related thermal stress response genes.
Why we use it: Heat shock proteins are highly sensitive to temperature fluctuations during sample handling.
Implications for interpretation: Elevated and uneven expression of heat shock genes (high Gini coefficient) indicates temperature stress during sample processing. These temperature-stressed cells may exhibit transcriptional changes that misrepresent their original biological state, potentially leading to spurious cell type identification or incorrect inference of cellular processes. This metric helps identify batch effects related to sample handling temperatures, which can be particularly important in multi-center studies or when comparing samples processed at different times.
How these metrics compare to Seurat and Scanpy QC
Most scRNA-seq researchers perform quality control using Seurat (R) or Scanpy (Python). These frameworks provide essential QC capabilities, but their default workflows focus on a subset of the metrics described above.
Covered by Seurat and Scanpy defaults:
- Mitochondrial % filtering - Seurat:
PercentageFeatureSet. Scanpy:sc.pp.calculate_qc_metrics. Unobio adds cell-type-aware thresholds.
Require manual setup or separate packages:
- Ribosomal % filtering - Both Seurat and Scanpy require users to specify gene lists manually. Built-in with Unobio.
- Hemoglobin contamination - Not included by default in either framework. Built-in with Unobio.
- Doublet detection - Seurat uses DoubletFinder (separate package). Scanpy uses Scrublet (separate package). Integrated in Unobio.
Not included in Seurat or Scanpy:
- Housekeeping stability - Not included. Built-in with Unobio.
- Shannon entropy - Not included by default. Built-in with Unobio.
- Signal-to-noise ratio - Not included. Built-in with Unobio.
- Gini-based indices (apoptosis, stress, enzyme, heat shock) - Not included. Built-in with Unobio.
- Composite quality score - Not available. Unobio provides a single score per cell that enables cross-dataset ranking.
Note: Seurat and Scanpy are powerful, flexible frameworks that enable researchers to implement any of these metrics with custom code. Unobio's contribution is not to replace these tools but to provide a standardized, pre-built QC layer that produces comparable scores across datasets without requiring each research group to implement and calibrate its own QC pipeline.
Key takeaways
- Single-cell RNA-seq data quality varies significantly across datasets, but the field lacks a unified QC standard that goes beyond basic filtering on gene count, UMI count, and mitochondrial percentage.
- The 12 metrics above provide a comprehensive quality profile that captures technical artifacts (doublets, dissociation effects, temperature stress) alongside fundamental data quality (signal-to-noise ratio, expression diversity).
- Unobio applies these metrics as a free, standardized QC pipeline. Upload your scRNA-seq data to receive processed, quality-scored datasets that are comparable across experiments and institutions.
- Community input drives metric evolution. If your research has identified QC dimensions not covered above, we welcome contributions to the scoring framework.
Want to see standardized QC scores on your scRNA-seq data? Explore Unobio's benchmarks and quality scoring. These quality metrics also feed into Unobio's Clinical Trial Search, enabling researchers to discover and filter single-cell datasets based on data quality alongside biological relevance.


