A genome graph for the Leukocyte Receptor Complex built from 30 underlying human haplotypes
Since UC Santa Cruz researchers posted the first draft of the human genome on the Internet in 2000, the current model of human genetics in the genome era has not changed: genomics relies on a single, haploid set of 24 reference chromosomes to interpret human genomes. Essentially all novel sequencing data is analyzed by mapping the sequenced reads only to this one reference set of 24 chromosomes to identify variants. This leads to a phenomenon called reference allele bias—the tendency to over-report versions (“alleles”) of the genome that are in the reference genome and miss non-reference alleles.Infact, some variants simply cannot be described with respect to the reference genome; these constitute the so-called dark matter of the genome. Not only is this suboptimal, it is biased by genetic subpopulation: the current reference is a better reference for some subpopulations than others – it is not a universal reference for humanity.
The Human Genome Variation Map (HGVM) is an enormously ambitious project that will create the first standard and comprehensive taxonomy for human variation and in the process transform genetics.Instead of describing genetic variations with respect to a changing, linear coordinate system (the current reference genome), it will add this missing variation to the reference, resulting in a structure that can be described as a mathematical graph: a genome graph. It will give each common variation a standard name that can be preserved in perpetuity, even as new variations are added. By standardising the identification of variations, it will avoid the current free-for-all that leads to the ambiguity that makes it frequently unclear if reported variations are distinct or equivalent. It will integrate complex structural variations with point variations, avoiding the current piecemeal fragmentation. Finally, it will substantially improve inference methods, reducing reference allele bias by creating methods that integrate overall common variation, not just variation present in the reference.
The project is a collaboration between the Computational Genomics Lab and members of the reference variation working group of the Global Alliance for Genomics and Health, and is building on many great community efforts concordant with its goals. It is supported by grants from the Simons Foundation, the Keck Foundation, Agilent Technologies and the National Institute of Health.