Single Cell RNA-seq Processing Metrics

March 6, 2025 · 9 min read
Single Cell RNA-seq Processing Metrics

We believe in creating community wide standards for the evaluation and curation of biological data. As a result, we are open sourcing our quality control processing metrics for single cell RNA (scRNA-seq) sequencing data that make up our Unobio scoring algorithm. The purpose of this is to explain what we chose, why we chose it, and how we believe these evaluation metrics can support researchers to quickly evaluate single cell sequencing data.

These quality control metrics and processing is free of charge to all researchers utilizing the Unobio platform. Simply upload your single cell RNA-sequencing data and you will receive the processed and quality controlled datasets.

We will also follow up with open sourcing our code and then our scoring algorithm itself. We want to make sure the metrics we provided have time for community feedback before releasing further artifacts.

The metrics with their sources are as follows:

  • Housekeeping Stability Index: https://pmc.ncbi.nlm.nih.gov/articles/PMC6748759/
  • Mitochondrial RNA Percentage Counts: https://hbctraining.github.io/scRNA-seq/lessons/04_SC_quality_control.html
  • Ribosomal RNA Percentage Counts: https://hbctraining.github.io/scRNA-seq/lessons/04_SC_quality_control.html
  • Hemoglobin Percentage: https://pmc.ncbi.nlm.nih.gov/articles/PMC9716519/
  • Shannon Entropy: https://www.nature.com/articles/ncomms15599
  • Doublet Detection: https://github.com/JonathanShor/DoubletDetection
  • Signal to Noise Ratio: https://academic.oup.com/nar/article/49/14/e83/6291165
  • Gini Coefficient: https://www.pnas.org/doi/10.1073/pnas.1721085115
  • Gini Apoptosis Index: https://www.pnas.org/doi/10.1073/pnas.1721085115
  • Gini Cell Stress Index: https://www.pnas.org/doi/10.1073/pnas.1721085115
  • Gini Enzyme Induced Index: https://www.pnas.org/doi/10.1073/pnas.1721085115
  • Gini Heat Shock Index: https://www.pnas.org/doi/10.1073/pnas.1721085115

These readouts form the basis of data preprocessing and curation underlying the Unobio scores which provide a way to filter and rank single cell sequencing datasets.

Understanding Our Quality Metrics

1. Housekeeping Stability Index

What it measures: The stability and consistency of expression across a curated set of housekeeping genes that should be relatively consistently expressed across cell types.

Why we use it: Housekeeping genes are expected to show stable expression regardless of cell type or condition. Significant variability in these genes indicates fundamental technical issues with RNA capture, library preparation, or sequencing.

Implications for interpretation: High variability in housekeeping gene expression suggests technical problems that likely affect all gene expression measurements, not just the housekeeping genes themselves. Cells with unstable housekeeping expression provide unreliable data for downstream analyses such as differential expression or trajectory inference. This metric serves as a foundational quality indicator; if housekeeping genes show irregular expression, even specialized gene signatures may be compromised.

2. Mitochondrial RNA Percentage Counts

What it measures: The percentage of total transcripts mapping to mitochondrial genes.

Why we use it: Elevated mitochondrial content is a well-established indicator of cellular stress or death. When cells begin to lyse, cytoplasmic RNA is lost while mitochondrial RNA (which is protected by the mitochondrial membrane) remains intact, resulting in artificially high mitochondrial percentages.

Implications for interpretation: Cells with high mitochondrial percentages often represent dying or damaged cells whose transcriptional profiles no longer reflect their normal physiological state. Including these cells can introduce stress-response signatures into what should be healthy cell populations. However, some cell types naturally express higher levels of mitochondrial genes (e.g., cardiomyocytes, hepatocytes), so thresholds should be cell-type aware. Datasets with generally high mitochondrial percentages may indicate poor sample handling or processing.

3. Ribosomal RNA Percentage Counts

What it measures: The percentage of transcripts mapping to ribosomal protein genes.

Why we use it: Abnormal ribosomal percentages can indicate stress responses or technical biases in RNA capture. While ribosomal transcripts are generally abundant in healthy cells, extreme values (either high or low) often signal quality issues.

Implications for interpretation: Unusually high ribosomal percentages may indicate stress responses that shift cellular resources toward protein synthesis machinery. Conversely, unusually low percentages might indicate selective loss of highly expressed transcripts during library preparation. Both scenarios potentially distort the cell's true transcriptional state. Ribosomal percentage should be evaluated in conjunction with other metrics for comprehensive quality assessment.

4. Hemoglobin Percentage

What it measures: The proportion of total transcripts mapping to hemoglobin genes (typically HBA and HBB gene families).

Why we use it: Elevated hemoglobin gene expression often indicates red blood cell contamination in non-blood samples or incomplete red blood cell lysis in blood samples.

Implications for interpretation: High hemoglobin percentages can skew normalization and downstream analyses, particularly in samples where red blood cells should be absent or rare. This contamination can artificially introduce blood-associated signatures into other cell types, potentially leading to misinterpretation of results. This metric is especially important in tissue samples with vascularization, where blood cells may be inadvertently captured along with the tissue of interest.

5. Shannon Entropy

What it measures: Shannon entropy quantifies the diversity and randomness in gene expression patterns within a cell.

Why we use it: Entropy serves as an indicator of transcriptional complexity. In high-quality cells, gene expression follows predictable patterns with moderate entropy values. Extremely low entropy suggests that too few genes are expressed (potentially indicating a low-quality or dying cell), while unusually high entropy often indicates technical noise.

Implications for interpretation: Cells with entropy values outside the expected range for their cell type should be scrutinized carefully. Low entropy cells might represent specialized cells with limited gene expression programs, but more often indicate poor-quality libraries. High entropy cells might represent technical artifacts rather than true biological diversity, potentially leading to spurious cell type identification. Entropy provides a different perspective on expression distribution from metrics like the Gini coefficient, with both offering complementary insights.

6. Doublet Detection

What it measures: Computational approaches to identify potential doublets—instances where two or more cells are inadvertently captured and sequenced as a single cell.

Why we use it: Doublets are unavoidable in most scRNA-seq protocols, with rates typically increasing with cell loading density. They create artificial "hybrid" transcriptional profiles that can be misinterpreted as intermediate cell states or novel cell types.

Implications for interpretation: Undetected doublets can severely confound lineage inference, pseudotime analysis, and cell type identification. They create apparent transitional states or false cell types that don't exist in vivo. By implementing robust doublet detection, researchers can maintain the fidelity of single-cell resolution and avoid erroneous biological conclusions. Various computational approaches exist, with different sensitivity and specificity profiles, so careful implementation is essential.

7. Signal to Noise Ratio

What it measures: The ratio of biological signal to technical noise in scRNA-seq data.

Why we use it: A higher signal-to-noise ratio (SNR) indicates more reliable data where biological variation dominates over technical variation. This metric helps ensure that downstream analyses capture meaningful biological differences rather than technical artifacts.

Implications for interpretation: Datasets or cells with low SNR may lead to spurious identification of differentially expressed genes or cellular states that reflect technical variation rather than biology. Conversely, high SNR datasets provide greater confidence in subtle biological signals, enabling the detection of rare cell types and transitional states. SNR can be calculated at different levels (gene, cell, cluster) to evaluate quality from multiple perspectives.

8. Gini Coefficient

What it measures: The inequality of overall gene expression distribution within a cell.

Why we use it: A high Gini coefficient often indicates poor-quality cells or technical artifacts where a few genes account for most of the transcripts. By measuring expression inequality, we can identify cells that deviate from expected biological patterns.

Implications for interpretation: Cells with unusually high Gini coefficients typically show dominant expression of a few genes, which may not reflect their true physiological state. Including these cells can introduce artificial heterogeneity in downstream analyses. The Gini coefficient provides a statistical framework for evaluating expression distribution that complements other metrics and serves as the foundation for the specialized Gini indices described below.

9. Gini Apoptosis Index

What it measures: Expression inequality among apoptosis-related genes.

Why we use it: Cells undergoing programmed cell death show characteristic gene expression patterns where specific apoptotic markers become highly upregulated.

Implications for interpretation: A high Gini coefficient for apoptosis genes indicates that a few apoptotic markers dominate the expression profile, suggesting the cell was dying during sample preparation. These dying cells often introduce technical noise and biological artifacts that can confound analyses of normal cellular states, particularly in processes like differentiation that can superficially resemble apoptosis. Identifying and removing these cells prevents contamination of biological signals with death-related signatures.

10. Gini Cell Stress Index

What it measures: Expression inequality among cell stress response genes.

Why we use it: Cells experiencing stress from dissociation protocols, FACS sorting, or extended processing times upregulate specific stress response pathways.

Implications for interpretation: A high stress index suggests that the cell was experiencing significant stress during collection or preparation, potentially altering its transcriptional profile away from its natural state in the tissue. This can lead to misinterpretation of cellular states and artificial clustering of cells based on shared stress responses rather than biological identity. By identifying stressed cells, researchers can account for technical stress signatures in their analyses or remove severely affected cells.

11. Gini Enzyme Induced Index

What it measures: Expression inequality among genes known to be artificially induced during enzymatic dissociation of tissues.

Why we use it: Enzymatic tissue dissociation methods (using collagenase, trypsin, etc.) can trigger specific gene expression responses that don't reflect the original in vivo state.

Implications for interpretation: Cells with high enzyme-induced index values were strongly affected by the dissociation process, potentially introducing technical artifacts that mask their true biological state. This metric is particularly important when comparing datasets generated using different dissociation protocols. Understanding dissociation-induced signatures allows researchers to distinguish technical variation from true biological differences, especially in cross-dataset integration.

12. Gini Heat Shock Index

What it measures: Expression inequality among heat shock protein genes and related thermal stress response genes.

Why we use it: Heat shock proteins are highly sensitive to temperature fluctuations during sample handling.

Implications for interpretation: Elevated and uneven expression of heat shock genes (high Gini coefficient) indicates temperature stress during sample processing. These temperature-stressed cells may exhibit transcriptional changes that misrepresent their original biological state, potentially leading to spurious cell type identification or incorrect inference of cellular processes. This metric helps identify batch effects related to sample handling temperatures, which can be particularly important in multi-center studies or when comparing samples processed at different times.

Follow Up

We are using these metrics as a base at Unobio to create processed and curated single-cell RNA sequencing datasets.

We are looking for community feedback on the metrics, what metrics they’d like to see, and if anything is missing. We plan to release the code and metadata labels once we have this feedback in place.

We are also looking for individuals who would like to test our pipeline by contributing their single-cell RNA sequencing datasets and see the processing that occurs.