Standardizing Genomics Quality Controls

Essential Metadata and QC Metrics for Whole Genome Sequencing
Standardizing metadata and quality control (QC) for genomic data is crucial for ensuring data reliability, reproducibility, and comparability across studies. Whole Genome Sequencing (WGS) data demands rigorous QC to accurately interpret biological findings. At Unobio, we outline a set of metadata standards and QC metrics specifically tailored for WGS data, in a way that allows us to harmonize and connect datasets across biology properly with filtering for noise.
Quick Reference: Whole Genome Sequencing QC Metrics
Here is a summarized list of the key metrics Unobio pipelines use for evaluating the quality of Whole Genome Sequencing data:
I. Raw Read Metrics (FASTQ File Analysis)
- Total Reads
- GC Content
- Adapter Contamination
- Read Length Distribution
- Quality Score Distribution (Per-Base Sequence Quality)
- Kmer Content Bias
II. Alignment Metrics (BAM File Analysis)
- Mapped/Unmapped Reads
- Duplicate Reads
- Insert Size Distribution
- Coverage Depth (Mean)
- Base Mismatch Rate
- Clipped Reads
- BAM Indexing Status
- Strand Bias
- Chimera Detection
III. Genomic Context & Specificity Metrics
- Telomere Integrity (Average Depth)
- STR Locus QC (Average Depth & Score)
- Chromosome Arm Coverage
- Stability Locus QC
- mtDNA QC
- BAM Indexing Status
Explaining Each Metric and Its Importance
The following metrics are categorized by the stage of data processing they primarily assess: raw reads (FASTQ), alignment (BAM), and genomic characteristics. Each metric includes a description of its utility and the rationale behind its inclusion.
I. Raw Read Metrics (FASTQ File Analysis)
These metrics provide an initial assessment of the quality and characteristics of the sequencing run directly from the raw data.
1. Total Reads:
- Description: The total number of sequencing reads obtained from the FASTQ file.
- Rationale: This is a fundamental metric indicating the overall sequencing output. A low number of total reads can suggest issues during library preparation, sequencing, or data handling, potentially leading to insufficient coverage for downstream analysis. The pipeline scales this, with higher scores for more reads, reflecting the desire for ample data.
2. GC Content:
- Description: The percentage of Guanine (G) and Cytosine (C) bases in the sequencing reads.
- Rationale: The GC content of a genome can vary, but significant deviations from the expected range (e.g., 40-60% for human genomes) can indicate contamination, amplification bias, or sequencing artifacts. This metric is critical for identifying potential biases that could affect variant calling and coverage uniformity. The pipeline assigns higher scores to samples with GC content closer to the ideal human range.
3. Adapter Contamination:
- Description: The percentage of sequencing reads that contain adapter sequences, which are remnants of the library preparation process.
- Rationale: High adapter contamination indicates that a significant portion of the sequencing effort was spent on non-genomic material. This reduces the effective sequencing depth and can lead to false positives in downstream analyses. Low adapter contamination is a hallmark of high-quality library preparation and sequencing. The pipeline penalizes higher contamination percentages. A higher score means higher adaption removal.
4. Read Length Distribution:
- Description: Characterizes the lengths of the sequencing reads, including minimum, maximum, mean, and a score for length variation.
- Rationale: Consistent read lengths are indicative of high-quality library preparation and sequencing. Significant deviations (e.g., very short reads, highly variable lengths) can point to DNA degradation or issues with the sequencing platform.
5. Quality Score Distribution (Per-Base Sequence Quality):
- Description: Assesses the distribution of Phred quality scores across all bases in the reads, often represented by the average quality score per base or overall average.
- Rationale: Quality scores (typically Q30 or Q20) directly reflect the probability of a base call being incorrect. High average quality scores and a narrow distribution indicate reliable base calls. Poor quality can significantly impact variant discovery and genotyping accuracy. The pipeline scores based on the mean quality, favoring higher scores.
6. Kmer Content Bias:
- Description: Measures the frequency of short nucleotide sequences (k-mers) within the reads and calculates a Shannon diversity index.
- Rationale: An unbiased k-mer distribution is expected for random genomic fragmentation. Significant overrepresentation or underrepresentation of specific k-mers can indicate sequence-specific biases during library preparation or sequencing, or the presence of contaminants. The Shannon diversity index reflects the richness and evenness of k-mer frequencies, with higher diversity being desirable.
II. Alignment Metrics (BAM File Analysis)
These metrics evaluate how well the raw reads have been mapped to a reference genome, providing insights into sample quality, contamination, and alignment efficiency.
7. Mapped/Unmapped Reads:
- Description: The percentage of total reads that successfully align to the reference genome and the percentage that do not.
- Rationale: A high percentage of mapped reads (typically >90-95% for WGS) indicates good sample quality and effective sequencing. A low mapping rate can suggest contamination (e.g., bacterial DNA in human samples), poor reference genome choice, or highly degraded DNA. The pipeline gives higher scores for higher mapping percentages.
8. Duplicate Reads:
- Description: The percentage of reads that are identical and map to the same genomic location, likely originating from PCR amplification bias during library preparation.
- Rationale: High duplication rates reduce the effective sequencing depth, as redundant reads provide no new information. While some duplicates are expected, excessive duplication can be a sign of low input DNA, over-amplification, or an unbalanced library. The pipeline penalizes higher duplicate percentages, so a higher Unobio score means a lower percentage of duplicates.
9. Insert Size Distribution:
- Description: The distribution (mean, median, standard deviation) of the distances between paired-end reads on the reference genome.
- Rationale: The insert size is a critical parameter for paired-end sequencing. Deviations from the expected insert size distribution (e.g., very short or highly variable inserts) can indicate issues with fragmentation, size selection, or adapter ligation during library preparation. This metric also aids in detecting structural variations.
10. Coverage Depth (Mean):
- Description: The average number of reads that cover each base in the reference genome.
- Rationale: Adequate and uniform coverage depth is essential for accurate variant calling. Low coverage increases the chance of missing true variants, especially in heterozygous regions. The pipeline rewards higher coverage depths, with ideal scores for depths sufficient for robust variant calling (e.g., 30x or higher).
11. Base Mismatch Rate:
- Description: The percentage of aligned bases that do not match the reference genome.
- Rationale: While some mismatches are expected due to true genetic variation, a high mismatch rate can indicate low sequencing quality (e.g., high error rate), contamination, or issues with the reference genome. A low, consistent mismatch rate is desirable.
12. Clipped Reads:
- Description: The percentage of reads where a portion of the read sequence is "clipped" (removed) during alignment because it doesn't align to the reference. This includes soft clipping (bases present but not aligned) and hard clipping (bases removed).
- Rationale: High soft clipping percentages can suggest the presence of adapter sequences not trimmed before alignment, or reads spanning structural variants. High hard clipping might indicate highly fragmented DNA or reads with poor quality ends. The pipeline focuses on soft-clipped reads as a key indicator of potential issues.
13. BAM Indexing Status:
- Description: A boolean flag indicating whether the BAM file has a corresponding index file (e.g., .bai).
- Rationale: BAM indexes are essential for efficient random access to reads within the BAM file, enabling faster downstream tools (e.g., variant callers, genome browsers) to operate. The absence of an index can significantly slow down analysis.
14. Strand Bias:
- Description: The imbalance in the number of reads mapping to the forward versus reverse strands at specific genomic loci.
- Rationale: For most genomic regions, an equal distribution of reads on both strands is expected. Significant strand bias can indicate technical artifacts like amplification bias or issues during library preparation that might lead to inaccurate variant calls. The pipeline assigns higher scores when the forward and reverse read percentages are close to 50%.
15. Chimera Detection:
- Description: A score derived from supplementary alignments that are often associated with chimeric reads. The pipeline identifies supplementary alignments, sorts chimeric contigs by frequency (with higher frequencies ranked at the top), selects the top 10 chimeras, and calculates a chimera score from these high-frequency contigs.
- Rationale: Supplementary alignments can indicate chimeric reads, which are artifacts that can arise from issues during library preparation (e.g., incomplete ligation, template switching) and can lead to false positive structural variant calls. Lower chimera scores are indicative of higher-quality DNA and library construction.
III. Genomic Context & Specificity Metrics
These metrics leverage the alignment information to assess specific genomic features or potential issues.
16. Telomere Integrity (Average Depth):
- Description: The average sequencing depth specifically over telomeric regions of the chromosomes.
- Rationale: Telomeres are repetitive regions at chromosome ends that are often challenging to sequence and map accurately. Consistent and sufficient coverage over telomeres indicates good overall library quality and even sequencing across difficult-to-map regions. Undercounting here might indicate issues with coverage in other repetitive regions as well.
17. STR Locus QC (Average Depth & Score):
- Description: The average sequencing depth across a set of Short Tandem Repeat (STR) loci, and a calculated quality score for these regions.
- Rationale: STRs are highly polymorphic and can be difficult to sequence accurately due to their repetitive nature. Assessing coverage and quality at these loci provides insight into the library's ability to handle challenging regions, which is important for studies involving genotyping or forensics.
18. Chromosome Arm Coverage:
- Description: Measures the proportion of mapped reads for each chromosome relative to the total number of reads. The pipeline calculates the number of mapped reads per chromosome, divides this by the total read count to obtain a ratio, and assigns a score based on predefined thresholds where lower ratios receive higher scores.
- Rationale: Significant deviations in the proportion of reads mapping to specific chromosomes can indicate large-scale chromosomal abnormalities (e.g., aneuploidy, large deletions/duplications) or biases during sample processing. Consistent read distribution across chromosomes is crucial for copy number variation (CNV) analysis and detecting chromosomal imbalances.
19. Stability Locus QC:
- Description: A metric providing an overall quality score based on coverage or other features at specific "stability" loci, which are often chosen for their known consistent behavior in sequencing.
- Rationale: These loci serve as internal controls. Their consistent and high-quality coverage can act as a benchmark for the overall sequencing run, indicating the reliability of the sequencing process.
20. mtDNA QC:
- Description: The percentage of reads that map to mitochondrial DNA (mtDNA) and the average coverage depth of the mitochondrial genome.
- Rationale: For nuclear WGS, a very low percentage of mtDNA reads is generally desired, as high levels can indicate contamination (e.g., from cell-free DNA in plasma) or a sample with a high mitochondrial copy number relative to nuclear DNA. The acceptable range varies by sample type (e.g., plasma, whole blood, tissue). The pipeline penalizes very high percentages, indicating a potential issue for nuclear WGS.
Conclusion
Implementing these standardized metadata and QC metrics for WGS data lets Unobio search engines quickly analyze and filter relevant genomics data. By evaluating these parameters, our search engine can confidently assess data quality, identify potential technical issues early, and ensure the reliability of genomic data.
But we want these metrics to be community defined. So if you or anyone you know think there are better ways to categorize and curate WGS data, please reach out and we will be happy to update our pipelines - since this search engine is a community feature we would like all researchers to be able to use.


