Whole Exome Sequencing (WES) is a powerful clinical diagnostic tool for discovering the genetic basis of many diseases. WES takes advantage of high coverage in the target regions and provides a high probability of variant detection in protein coding regions of the genome. Compared to Whole Genome Sequencing (WGS) technology, WES offers distinct advantages in terms of speed, efficiency and cost-effectiveness. A major shortcoming of WES is uneven coverage of sequence reads over the exome targets, contributing to many low coverage regions, which hinders accurate variant calling. The inaccuracy in variant calling affects detection of sequence changes, and may contribute to the missing heritability of genetic disorders. We sought to delineate factors that affect coverage by assessing sequencing data on a total of 176 samples generated from different WES platforms.
We devised two novel metrics, Unevenness (UE) Score and Cohort Coverage Sparseness (CCS) Score, to assess the distribution of coverage of sequence reads over the exome datasets. Specifically, the UE score measures non-uniformity of the coverage and the CCS score measures the percentage of base pairs with low coverage in a specific genomic region. Employing these metrics, we revealed both local (coverage of a given exon) and global (coverage of all exons across the genome) non-uniformity of coverage in the exome sequencing data. We also found non-random occurrences of low coverage regions; these regions were often associated with high GC content, repetitive sequences and segmental duplications, and encompassed functionally relevant genes.
ExomeCQA aims to calculate the CCS score and UE score automatically for any cohort-based WES dataset. The main functions of ExomeCQA are to:
- Calculate CCS scores of each gene for a specific chromosome, or for all chromosomes
- Calculate UE scores for each exon within a specific region, or for all chromosomes
- Report the features (positions/width/height of peaks) of coverage distribution in each exon
Citation: Qingyu Wang, Shashikant Cooduvalli, Naomi Altman, and Santhosh Girirajan. "Novel metrics to measure coverage in whole exome sequencing datasets reveal local and global non-uniformity," manuscript in preparation.