A complete and accurate genome sequence forms the basis of all downstream genomic analyses. However, even the human reference genome remains incomplete, which affects the quality of experiments and can mask true genomic variations. Long-read sequencing technologies, like Oxford Nanopore, have begun to correct this deficiency and are enabling the automated reconstruction of reference-quality genomes at relatively low cost. In a collaborative effort, we sequenced the NA12878 human genome and assembled using Canu. The sequencing set includes 5-fold coverage of ‘ultra-long’ reads with an N50 of >100 kbp and max length >850 kbp. Despite the low coverage, the assembly contiguity (NG50 ~6.4 Mb) exceeds that of similar coverage assemblies using other long-read technologies. Additionally, we model expected assembly contiguity and predict 30-fold coverage of ultra-long sequences can exceed a 40 Mbp NG50 and match the contiguity of the current reference. We further utilize a new approach, “trio binning”, which relies on short parental sequences to partition long reads to generate both haplotypes for a diploid offspring. Application to NA12878 reveals structural variants between haplotypes that were missed by traditional phasing approaches. Further combination of these technologies with complementary scaffolding approaches such as chromatin conformation capture (Hi-C) may soon enable the complete reconstruction of vertebrate haplotypes.
Sergey received his PhD in computer science in 2012 under the supervision of Mihai Pop at the University of Maryland. He joined the National Bioforensics Analysis Center in 2011 and was appointed as an associate principal investigator in 2014. During this time, he pioneered the use of single-molecule sequencing for the reconstruction of complete genomes. In 2015, he joined the National Human Genome Research Institute as a founding member of the Genome Informatics Section. His research focuses on the efficient analysis of large-scale genomic datasets and new methods for metagenomic analysis and assembly of high-noise single-molecule sequencing data.