Variation graphs facilitate genomic discovery
Katharine Miller | Inquiry UCSC
After the first human genome was successfully sequenced in 2003, researchers established it as the reference genome. It became the singular, highest-quality, most well-understood, standardized genome against which all other human genomes would be mapped and compared for the foreseeable future.
It turns out this commitment to a single reference genome has a big downside. Called reference bias or mapping bias, it can cause potentially important observations to be misinterpreted or rejected when they don’t fit the expected pattern. “With the existing reference genome,” said Benedict Paten, UC Santa Cruz assistant professor of biomolecular engineering, “it’s easier to find variants that are in the reference genome than ones that aren’t.”
The problem is particularly acute for structural variations in the genome—long stretches of DNA that differ from the reference in various ways, including changes known as insertions, deletions, inversions, and translocations. When these interesting and potentially important variants exist in a new sample that is being mapped against the reference, they might not be seen at all. As a consequence, the new sample is deemed more similar to the reference than it actually is. The failure to find important variants can have consequential downstream implications for patients if a missed variant is the cause of a genetic disorder or plays a key role in a patient’s cancer. And as the pace of sequencing-based genomic research continues to increase, so too have the potential impacts of reference bias.
To address this critical and growing concern, Paten and colleagues at UCSC and the Sanger Institute set out to build a set of technologies for replacing the existing reference genome with a more comprehensive foundational structure. “The natural thing is to have a graph that includes all the known gene variants,” Paten said. And though the idea of replacing the linear reference genome with a graph might sound simple, he said, “we’ve had to solve some pretty tricky computer science problems to make this work.”
Having a great solution doesn’t mean it will be used, especially given the deep entrenchment of the single reference genome in the field. Nevertheless, even members of the Genome Reference Consortium who’ve spent their careers maintaining and improving upon the linear reference agree that genome variation graphs make sense. “The representation of variation in the human population is a type of data that fits very neatly into and is well-represented by a graph model,” said Valerie Schneider, program head for sequence displays and tools at the NIH’s National Center for Biotechnology Information (NCBI) and team lead for the NCBI’s involvement in the Genome Reference Consortium. “Benedict is at the leading edge of where this is going.”
Tina Graves-Lindsay, leader of the reference genomes group at Washington University’s McDonnell Genome Institute in St. Louis, agreed, “When we get to the point where they’re ready to be used, genome graphs are the future representation.”