TechnicalMAY 22, 2025·12 min read

Genomics Quality Control: WGS QC Metrics and Standards.

Genomics quality control demands standardized metrics. This guide covers 20 WGS QC metrics across FASTQ, BAM, and genomic context, with scoring rationale.

Filipp KramerUnobio · Research Team

TECHNICAL · SCIENCE
RELEASE 2026.03

FIG. 01 — TECHNICAL · Genomics Quality Control: WGS QC Metrics and Standards

Essential metadata and QC metrics for whole genome sequencing

Standardizing metadata and quality control (QC) for genomic data is crucial for ensuring data reliability, reproducibility, and comparability across studies. Without consistent standards, metadata leakage propagates across experiments and erodes the interpretability of downstream analyses. Whole Genome Sequencing (WGS) data demands rigorous QC to accurately interpret biological findings. At Unobio, we outline a set of metadata standards and QC metrics specifically tailored for WGS data, in a way that allows us to harmonize and connect datasets across biology properly with filtering for noise.

Quick reference: whole genome sequencing QC metrics

Here is a summarized list of the key metrics Unobio pipelines use for evaluating the quality of Whole Genome Sequencing data:

I. Raw Read Metrics (FASTQ File Analysis)

Total Reads
GC Content
Adapter Contamination
Read Length Distribution
Quality Score Distribution (Per-Base Sequence Quality)
Kmer Content Bias

II. Alignment Metrics (BAM File Analysis)

Mapped/Unmapped Reads
Duplicate Reads
Insert Size Distribution
Coverage Depth (Mean)
Base Mismatch Rate
Clipped Reads
BAM Indexing Status
Strand Bias
Chimera Detection

III. Genomic Context & Specificity Metrics

Telomere Integrity (Average Depth)
STR Locus QC (Average Depth & Score)
Chromosome Arm Coverage
Stability Locus QC
mtDNA QC
BAM Indexing Status

How each metric contributes to data reliability

The following metrics are categorized by the stage of data processing they primarily assess: raw reads (FASTQ), alignment (BAM), and genomic characteristics. Each metric includes a description of its utility and the rationale behind its inclusion.

I. Raw Read Metrics (FASTQ File Analysis)

These metrics provide an initial assessment of the quality and characteristics of the sequencing run directly from the raw data.

1. Total Reads:

Description: The total number of sequencing reads obtained from the FASTQ file.
Rationale: This is a fundamental metric indicating the overall sequencing output. A low number of total reads can suggest issues during library preparation, sequencing, or data handling, potentially leading to insufficient coverage for downstream analysis. The pipeline scales this, with higher scores for more reads, reflecting the desire for ample data.

2. GC Content:

Description: The percentage of Guanine (G) and Cytosine (C) bases in the sequencing reads.
Rationale: The GC content of a genome can vary, but significant deviations from the expected range (e.g., 40-60% for human genomes) can indicate contamination, amplification bias, or sequencing artifacts. This metric is critical for identifying potential biases that could affect variant calling and coverage uniformity. The pipeline assigns higher scores to samples with GC content closer to the ideal human range. For single-cell sequencing data, GC content evaluation requires additional cell-level considerations - see our guide to scRNA-seq processing metrics for the single-cell perspective.

3. Adapter Contamination:

Description: The percentage of sequencing reads that contain adapter sequences, which are remnants of the library preparation process.
Rationale: High adapter contamination indicates that a significant portion of the sequencing effort was spent on non-genomic material. This reduces the effective sequencing depth and can lead to false positives in downstream analyses. Low adapter contamination is a hallmark of high-quality library preparation and sequencing. The pipeline penalizes higher contamination percentages. A higher score means higher adaption removal.

4. Read Length Distribution:

Description: Characterizes the lengths of the sequencing reads, including minimum, maximum, mean, and a score for length variation.
Rationale: Consistent read lengths are indicative of high-quality library preparation and sequencing. Significant deviations (e.g., very short reads, highly variable lengths) can point to DNA degradation or issues with the sequencing platform.

5. Quality Score Distribution (Per-Base Sequence Quality):

Description: Assesses the distribution of Phred quality scores across all bases in the reads, often represented by the average quality score per base or overall average.
Rationale: Quality scores (typically Q30 or Q20) directly reflect the probability of a base call being incorrect. High average quality scores and a narrow distribution indicate reliable base calls. Poor quality can significantly impact variant discovery and genotyping accuracy. The pipeline scores based on the mean quality, favoring higher scores.

6. Kmer Content Bias:

Description: Measures the frequency of short nucleotide sequences (k-mers) within the reads and calculates a Shannon diversity index.
Rationale: An unbiased k-mer distribution is expected for random genomic fragmentation. Significant overrepresentation or underrepresentation of specific k-mers can indicate sequence-specific biases during library preparation or sequencing, or the presence of contaminants. The Shannon diversity index reflects the richness and evenness of k-mer frequencies, with higher diversity being desirable.

II. Alignment Metrics (BAM File Analysis)

These metrics evaluate how well the raw reads have been mapped to a reference genome, providing insights into sample quality, contamination, and alignment efficiency.

7. Mapped/Unmapped Reads:

Description: The percentage of total reads that successfully align to the reference genome and the percentage that do not.
Rationale: A high percentage of mapped reads (typically >90-95% for WGS) indicates good sample quality and effective sequencing. A low mapping rate can suggest contamination (e.g., bacterial DNA in human samples), poor reference genome choice, or highly degraded DNA. The pipeline gives higher scores for higher mapping percentages.

8. Duplicate Reads:

Description: The percentage of reads that are identical and map to the same genomic location, likely originating from PCR amplification bias during library preparation.
Rationale: High duplication rates reduce the effective sequencing depth, as redundant reads provide no new information. While some duplicates are expected, excessive duplication can be a sign of low input DNA, over-amplification, or an unbalanced library. The pipeline penalizes higher duplicate percentages, so a higher Unobio score means a lower percentage of duplicates.

9. Insert Size Distribution:

Description: The distribution (mean, median, standard deviation) of the distances between paired-end reads on the reference genome.
Rationale: The insert size is a critical parameter for paired-end sequencing. Deviations from the expected insert size distribution (e.g., very short or highly variable inserts) can indicate issues with fragmentation, size selection, or adapter ligation during library preparation. This metric also aids in detecting structural variations.

10. Coverage Depth (Mean):

Description: The average number of reads that cover each base in the reference genome.
Rationale: Adequate and uniform coverage depth is essential for accurate variant calling. Low coverage increases the chance of missing true variants, especially in heterozygous regions. The pipeline rewards higher coverage depths, with ideal scores for depths sufficient for robust variant calling (e.g., 30x or higher).

11. Base Mismatch Rate:

Description: The percentage of aligned bases that do not match the reference genome.
Rationale: While some mismatches are expected due to true genetic variation, a high mismatch rate can indicate low sequencing quality (e.g., high error rate), contamination, or issues with the reference genome. A low, consistent mismatch rate is desirable.

12. Clipped Reads:

Description: The percentage of reads where a portion of the read sequence is "clipped" (removed) during alignment because it doesn't align to the reference. This includes soft clipping (bases present but not aligned) and hard clipping (bases removed).
Rationale: High soft clipping percentages can suggest the presence of adapter sequences not trimmed before alignment, or reads spanning structural variants. High hard clipping might indicate highly fragmented DNA or reads with poor quality ends. The pipeline focuses on soft-clipped reads as a key indicator of potential issues.

13. BAM Indexing Status:

Description: A boolean flag indicating whether the BAM file has a corresponding index file (e.g., .bai).
Rationale: BAM indexes are essential for efficient random access to reads within the BAM file, enabling faster downstream tools (e.g., variant callers, genome browsers) to operate. The absence of an index can significantly slow down analysis.

14. Strand Bias:

Description: The imbalance in the number of reads mapping to the forward versus reverse strands at specific genomic loci.
Rationale: For most genomic regions, an equal distribution of reads on both strands is expected. Significant strand bias can indicate technical artifacts like amplification bias or issues during library preparation that might lead to inaccurate variant calls. The pipeline assigns higher scores when the forward and reverse read percentages are close to 50%.

15. Chimera Detection:

Description: A score derived from supplementary alignments that are often associated with chimeric reads. The pipeline identifies supplementary alignments, sorts chimeric contigs by frequency (with higher frequencies ranked at the top), selects the top 10 chimeras, and calculates a chimera score from these high-frequency contigs.
Rationale: Supplementary alignments can indicate chimeric reads, which are artifacts that can arise from issues during library preparation (e.g., incomplete ligation, template switching) and can lead to false positive structural variant calls. Lower chimera scores are indicative of higher-quality DNA and library construction.

III. Genomic Context & Specificity Metrics

These metrics leverage the alignment information to assess specific genomic features or potential issues.

16. Telomere Integrity (Average Depth):

Description: The average sequencing depth specifically over telomeric regions of the chromosomes.
Rationale: Telomeres are repetitive regions at chromosome ends that are often challenging to sequence and map accurately. Consistent and sufficient coverage over telomeres indicates good overall library quality and even sequencing across difficult-to-map regions. Undercounting here might indicate issues with coverage in other repetitive regions as well.

17. STR Locus QC (Average Depth & Score):

Description: The average sequencing depth across a set of Short Tandem Repeat (STR) loci, and a calculated quality score for these regions.
Rationale: STRs are highly polymorphic and can be difficult to sequence accurately due to their repetitive nature. Assessing coverage and quality at these loci provides insight into the library's ability to handle challenging regions, which is important for studies involving genotyping or forensics.

18. Chromosome Arm Coverage:

Description: Measures the proportion of mapped reads for each chromosome relative to the total number of reads. The pipeline calculates the number of mapped reads per chromosome, divides this by the total read count to obtain a ratio, and assigns a score based on predefined thresholds where lower ratios receive higher scores.
Rationale: Significant deviations in the proportion of reads mapping to specific chromosomes can indicate large-scale chromosomal abnormalities (e.g., aneuploidy, large deletions/duplications) or biases during sample processing. Consistent read distribution across chromosomes is crucial for copy number variation (CNV) analysis and detecting chromosomal imbalances.

19. Stability Locus QC:

Description: A metric providing an overall quality score based on coverage or other features at specific "stability" loci, which are often chosen for their known consistent behavior in sequencing.
Rationale: These loci serve as internal controls. Their consistent and high-quality coverage can act as a benchmark for the overall sequencing run, indicating the reliability of the sequencing process.

20. mtDNA QC:

Description: The percentage of reads that map to mitochondrial DNA (mtDNA) and the average coverage depth of the mitochondrial genome.
Rationale: For nuclear WGS, a very low percentage of mtDNA reads is generally desired, as high levels can indicate contamination (e.g., from cell-free DNA in plasma) or a sample with a high mitochondrial copy number relative to nuclear DNA. The acceptable range varies by sample type (e.g., plasma, whole blood, tissue). The pipeline penalizes very high percentages, indicating a potential issue for nuclear WGS.

How Unobio's QC approach compares to standard tools

Researchers working with WGS data typically rely on a combination of open-source tools for quality control. Below is how the metrics above map to established tools and where Unobio's pipeline adds context beyond what individual tools provide.

Raw read QC (FASTQ) - Standard tools: FastQC, MultiQC. These provide per-base quality scores, GC content, adapter contamination, and overrepresented sequences. Unobio adds automated scoring across all FASTQ metrics with a composite quality grade, flagging datasets that pass individual metrics but fail in combination.

Alignment QC (BAM) - Standard tools: SAMtools flagstat, Picard CollectAlignmentSummaryMetrics, Picard MarkDuplicates. These provide mapping rate, duplicate rate, insert size distribution, and coverage depth. Unobio adds integrated scoring that weighs alignment metrics against expected ranges for the sample type, with strand bias and chimera detection as standard metrics.

Variant-level QC - Standard tools: GATK VariantEval, bcftools stats. These provide Ti/Tv ratio, het/hom ratio, and variant count by type. Unobio's current pipeline focuses on pre-variant QC (read and alignment level). Variant-level metrics are a planned addition.

Genomic context - Standard tools: Custom scripts (typically), covering specific regions like telomeres, STRs, and mtDNA. Unobio adds standardized genomic context metrics (telomere integrity, STR locus QC, chromosome arm coverage, mtDNA QC) built into every pipeline run. These are rarely assessed systematically in standard workflows.

Cross-dataset comparability - No standard tool exists for this. Unobio provides a unified scoring system that enables filtering and ranking datasets across studies, institutions, and sequencing platforms.

Note: FastQC, SAMtools, Picard, and GATK remain essential tools for genomics research. Unobio's QC pipeline is designed to complement these tools by providing an integrated scoring layer that enables cross-dataset comparison and automated quality filtering at scale.

Key takeaways

Standardized QC metrics for WGS data are essential for cross-study comparability, but the field lacks a unified scoring framework that integrates raw read, alignment, and genomic context metrics into a single quality assessment.
The 20 metrics outlined above cover the three critical stages of WGS quality evaluation: raw reads (FASTQ), alignment (BAM), and genomic context. Together, they provide a comprehensive quality profile for any WGS dataset.
Unobio's QC pipeline implements these metrics as a free, community-accessible service. Researchers can upload WGS data and receive standardized quality scores that enable dataset comparison across institutions and sequencing platforms. This quality layer also feeds into Unobio's Clinical Trial Search platform, where researchers can filter and discover genomic datasets based on data quality, not just metadata tags.
These metrics are designed to evolve with community input. If your research group has identified additional QC dimensions that should be incorporated, we welcome the feedback - the goal is a community-defined standard that serves all researchers.

Want to see how standardized QC scoring works on your genomics data? Explore Unobio's benchmarks and data quality tools.

FILED UNDER · TECHNICAL · SCIENCE

Continue reading.

Blog · MAR 06, 2026