Complete Human Genome Sequence is in, and the Missing Pieces Are...

April 1, 2022 |

2 min read |

DNA and genetics research concept. Hand is holding glowing DNA molecule in hand.

The Telomere-to-Telomere (T2T) Consortium has completed the sequencing of the entire genome, referring to it as T2T-CHM13.

As the globe recovers from the COVID-19 pandemic, an unlikely tool has been gifted to geneticists, epidemiologists and researchers hoping to develop the next marketable prophylactic. The Telomere-to-Telomere (T2T) Consortium has completed sequencing of the entire genome, referring to it as T2T-CHM13. 92% of the genome has been sequenced since 2003, and the final 8% stretch that is now complete belongs to a chromosome that is now understood to be critical for immunity, adaptivity and survival.

In totality, knowledge of the full human genome, comprised of 3,054,815,472 base pairs of nuclear DNA, twenty-two chromosomes, an X chromosome and 16,569 base pairs of mitochondrial DNA, has been shared with the scientific community and public. Contributing to this astonishing data set are the 3.055 billion base pairs in T2T-CHM13.

The full details of T2T-CHM13 were published Thursday in Science journal, which is tied to the American Association for the Advancement of Science (AAAS). Cross-analysis using industry-standard sequencing methods, such as the 100X Illumina PCR-Free sequencing and 120X Oxford Nanopore ultralong-read sequencing, uncovered keys to understanding human biology. To assemble the data, researchers used HiFi reads to construct a high-resolution assembly string graph. This allowed for visual analysis of overlaps, error correction and compression.

Complex areas of the genome required additional analysis beyond a string graph. One of these areas includes the five rDNA arrays, located on the short arms of acrocentric chromosomes with a length of around 45,000 base pairs, varying from person to person.

To assemble these regions, individual sparse de Bruijn graphs were made through algorithmic bioinformatics. The sequence was further edited using a “targeted validation of long tandem repeats to identify any errors missed by the genome-wide approach.” Even with thorough analysis, it is estimated that one error remains in every 10 megabase pairs, equivalent to one in one million.

The article describes how the data reveals the history of human biologies, such as how “local ancestry analysis shows that most of the CHM13 genome is of European origin, including regions of Neanderthal introgression.” Alternatively, looking ahead, the article mentions that this discovery will “drive future discovery in human genomic health and disease.”

Over twenty years ago, in 2000, the Human Genome Project kickstarted a biotechnology renaissance that led to gene therapies, personalized medicine and research into how to prevent diseases that have a contributing genetic component. The initial genome sequencing was published by the Genome Reference Consortium (GRC) in 2013, with a few gaps filled later in 2019. The resulting sequence was 151 mega-base pairs- a 92% complete human genome, albeit littered with gaps for important areas such as the short arms of the five acrocentric chromosomes.

The task of completing research started by the Human Genome Project and GRC collaboration was no small feat. Until now, technology has not been advanced enough to expose the remaining 3.055 billion base pairs. Small pieces of the puzzle were found using genetic linkage analysis, sequencing bacterial artificial chromosomes (BACs) and fingerprint maps.

Jazmine Colatriano M.S.