New Long Read Assemblers for De Novo Genomes Promise Speed, Scalability

Figure 1: Nanopore sequencing results.

(a) Throughput in gigabases from each of three flowcells for eleven samples, with total throughput at top. (b) Read N50s for each flowcell. (c) Alignment identities against GRCh38. Medians in a, b and c shown by dashed lines, dotted line in c is mode. (d) Genome coverage as a function of read length. Dashed lines indicate coverage at 10 and 100 Kb. HG00733 is bolded as an example. (e) Alignment identity for standard and run-length encoded (RLE) reads. Data for HG00733 chromosome 1 are shown. Dashed lines denote quartiles.

GenomeWeb | Andrew P. Han | Aug 16, 2019

NEW YORK – Newly released algorithms can assemble de novo human genomes from long read sequencing data in just a few hours’ time.

Shasta, an in-memory computing-driven algorithm developed by researchers at the Chan Zuckerberg Initiative (CZI) and tested by researchers from the University of California, Santa Cruz, can complete a de novo human genome assembly in under six hours, the authors wrote, for an average cost of $70 per sample.

Using reads generated by the Oxford Nanopore Technologies PromethIon sequencing instrument, the researchers were able to create “near chromosome-level” scaffolds for eleven genomes. While Shasta had less-contiguous assemblies (contig N50s between 19.3 and 37.8 megabases) than some other long read assemblers, Shasta had fewer misassembles, the authors wrote. They posted their study to BioRxiv July 26.

And earlier in July, two former Pacific Biosciences veterans, working on their own now, described Peregrine, an assembler that uses an indexing scheme to assemble reads that meet certain accuracy and length requirements. Using previously generated datasets of PacBio long reads, the authors reported that they were able to assemble a genome with 30x coverage in 100 minutes wall clock time. The N50 score was greater than 20 megabases. They also posted a preprint to BioRxiv.

Developers for both algorithms said they hoped their assemblers could increase the pace of genomic research and help researchers find new structural variants.

“Shasta and other tools are cheap and quick, designed with the intent to be on the cloud,” said Benedict Paten, a computational geneticist at UC-Santa Cruz and an author of the Shasta preprint. “They really give us the power to scale out nanopore sequencing. We’re easily talking about assembling hundreds of de novo genomes in the next couple years.”

[ Read more. ]