Advancement: Exploring the use of long read technology for haplotype phasing and allele specific epigenetic profiling

Speaker Name: Trevor Pesout
Speaker Title: PhD Student (Advisor: Benedict Paten)
Speaker Organization: Biomolecular Engineering & Bioinformatics
Abstract:  Third-generation long read sequencing has enabled scientists to assemble high-quality genomes and determine large structural variation in ways that second-generation short reads cannot. As most analysis techniques are designed to accept short but highly accurate read information, new technologies must be developed to exploit the benefits of longer reads while mitigating their error-prone nature. I present work on MarginPhase, a long read variant caller which uses the diploid nature of the human genome to differentiate heterozygous sites from read errors. A Hidden Markov Model is used to bipartition reads and simultaneously genotype and haplotype the sample. Next I describe planned work to use and improve on the SignalAlign technology, which performs a banded alignment of ONT event data to a reference and uses a Hierarchical Dirichlet Process model to detect methlyation. We expect that incorporating methlyation into the MarginPhase mo! del will enable patterns of allele specific methylation to be recognized in the read data; in the same way that reads spanning heterozygous sites allow a partitioning of the reads into haplotypes, this should improve the bipartioning and overall results. I plan to initially demonstrate this improvement by using it to fully assemble the two ChrX haplotypes in the NA12878 sample. Finally, I propose an engineering project wherein an end-to-end genomic analysis pipeline using the described technologies is developed for custom hardware. The designed pipeline takes raw data from ONT sequencers as input and performs basecalling, methlyation detection, alignment, and variant calling, while managing data partitioning and the export of specific operations to processors local to the data.
Last modified: Aug 23, 2018