VARIANT FILTERING AND RANKING PIPELINE
By Vasily Ramenskiy, ICNN
The variant annotation pipeline includes multiple independent components:
1) Genome sequence markup. Coding exons, UTRs, splice sites, upstream promoter and downstream regions, etc. marked up for each gene transcript (isoform) in UCSC knownGene and/or GENCODE gene sets. This can characterize location of each variant respective to a closest gene transcript, for example, overlap with splice site, upstream promoter region. This component also includes genome markup by RepeatMasker software (low complexity regions, tandem repeats, ALU repeats, etc.)
2) Single nucleotide variant annotation databases. Information on variant allele population frequencies: dbSNP 138 and Exome Variant Server v.0.0.21 databases.
3) Predictions for coding missense variants. This component includes MapSNPs/PolyPhen-2 software suite and additional precomputed functional predictions for all coding variants by SIFT, LRT, and MutationTaster methods available from dbNSFP v.2.0 database.
4) CADD v.1.0 database. Combined Annotation Dependent Depletion (CADD) is a framework that integrates multiple annotations into one metric by contrasting variants that survived natural selection with simulated mutations (Kircher et al, 2014). The database contains predicted pathogenicity scores calculated for all possible ~8.6 bln variants in the human genome.
5) Known disease and phenotype associations. HGMDPro 2013 (Human Gene Mutation Database), UCSC GWAS catalog (~17,000 SNPs), publicly available eQTL resources: eQTL resources at the Pritchard Lab (216,768 eQTLs from 17 studies); NCBI GTEX, Aug 2013: (40,556 eQTLs in 7 tissues); Harvard School of Public Health eQTL databases (219,706 eQTLs in lymphoblastoid cells).
6) Comparative genomics resources. Estimates of evolutionary constraint or acceleration acting at individual nucleotide or element level in primate, mammalian (Lindblad-Toh, 2011) and vertebrate lineages, measured with Genomic Evolutionary Rate Profiling (GERP), PhyloP and PhastCons methods, downloaded from UCSC Genome browser database. The datasets reported in (Lindblad-Toh, 2011) comprise a high-resolution map of evolutionary constraint in the human genome based on 29 eutherian mammals. This dataset includes lists of constrained sequence elements and measures of constraint at the nucleotide level.
7) Predicted binding sites for human transcription factors (TFBS). This component is based on the HOCOMOCO database which contains high quality manually curated position weight matrices (PWM) for TFBS models of ~400 human transcription factors. Currently, HOCOMOCO is two times larger than the similar open access resource JASPAR, with model performance superseding both JASPAR and TRANSFAC in terms of TFBS recognition accuracy (Kulakovskiy, 2012). There are ~59 mln of TFBS motifs for 373 factors mapped to the hg19 sequence. For every predicted TFBS motif in the genome, there are listed disruptive single nucleotide substitutions, based on the compatibility of altered sequence with corresponding factor position weight matrix.
8) Chromatin states markup, based on ChromImpute 25-state segmentation.
Unlike some existing methods (Ward, 2011) the suggested analysis approach is not restricted to common variation and naturally incorporates both common, known and novel SNVs and indels resulting from sequencing analysis.
Using the variant annotation pipeline, variants in the specific regions of interest suggested as explained in Aim 2 can be assigned to one of the following tiers:
Tier 1. Disruptive variants in coding exons: nonsense (stop-gain), splice site variants and short indels. These variants are easily detected and are believed to have the largest impact on the encoded proteins.
Tier 2. Damaging coding variants: missense coding variants predicted to be damaging by PolyPhen-2. The last decade has seen the active development of computational methods for prediction of functional effect of coding missense variants based on evolutionary and structural considerations, with accuracy of such methods currently reaching 75–80%. However, coding variants comprise the negligible fraction of the total human genome variation and cannot fully explain the phenotypic variation attributed to genetic factors, thus emphasizing the need for approaches that will incorporate non-coding variation into analysis.
Tiers 3,4. Variants conserved or accelerated in more than one lineage (mammalian, primate, vertebrate) or in one lineage only. There is a growing body of evidence indicating that a significant fraction of phenotypically important DNA elements are located in the non-coding regions of the human genome. The annotation components of our pipeline indicate whether a variant overlaps with human or primate accelerated region or, conversely, falls into primate or mammalian highly conserved sequence region or site.
Tiers 1 and 2 contain coding variants, with tiers 3 and 4 being mostly non-coding. However, a missense variant may fall into 3 or 4 if it is conserved at nucleotide level but predicted as benign at the protein level.
Variants can be ranked within each tier, if needed, based on additional lines of evidence: for example, a variant is located in a gene region known to be involved in gene regulation (upstream promoter region, first intron); overlaps with known transcription factor binding site (TFBS) motif and disrupts it. It is worth noting that each feature considered alone has limited predictive power. For example, conservation is a poor predictor of functional regulatory variation itself, whereas it can accurately trace a protein–DNA binding event when considered in combination with nucleotide-level DNA accessibility (Stamatoyannopoulos, 2012).
Finally, four major independent sources of information can be used for variant filtering and ranking:
1) Phenotype information and segregation patterns in affected and non-affected sequenced individuals.
For example, variants seen in affected individuals in more than one pedigree.
2) Variant allele frequencies in married-in sequenced individuals and dbSNP.
3) Overlap with regulatory active or transcribed chromatin regions, based on ChromImpute 25-state segmentation.
4) Potentially functional variants grouped in tiers T1-T4 based on bioinformatic annotations
Overlapping these sources of information will generate sets of potentially most functional coding and non-coding variants in the region of interest.
Stamatoyannopoulos, J. A. What does our genome encode? Genome Res. 22, 1602–1611 (2012).
Kircher, M. et al. A general framework for estimating the relative pathogenicity of human genetic variants. Nat Genet advance online publication, (2014).
Lindblad-Toh, K. et al. A high-resolution map of human evolutionary constraint using 29 mammals. Nature 478, 476–482 (2011).
Kulakovskiy, I. V. et al. HOCOMOCO: a comprehensive collection of human transcription factor binding sites models. Nucleic Acids Res 41, D195–D202 (2013).
Ward, L. D. & Kellis, M. HaploReg: a resource for exploring chromatin states, conservation, and regulatory motif alterations within sets of genetically linked variants. Nucl. Acids Res. (2011). doi:10.1093/nar/gkr917