On 14 April 2003, the International Human Genome Sequencing Consortium announced the completion of the Human Genome Project through the National Human Genome Research Institute and the US Department of Energy. The announcement was timed to the 50th anniversary of Watson and Crick’s double helix paper. The reference sequence released that day covered, by the NHGRI’s own later accounting, about 92% of the human genome. The remaining 8% was technically out of reach. It stayed that way for the next nineteen years.
The missing fraction was not random leftover material. It was concentrated in the parts of the genome that are most difficult to sequence with the technology of the early 2000s. Centromeres, which coordinate the separation of chromosomes during cell division. Telomeres, the repetitive ends of chromosomes. The short arms of the five acrocentric chromosomes. Long stretches of segmental duplication, where almost-identical sequences appear in multiple places and short-read sequencers cannot tell one copy from another. These regions were not skipped because they were unimportant. They were skipped because the methods available could not reliably read them.
What the original project actually produced
It is worth being precise about what was finished in 2003 and what was not. According to the NHGRI’s Human Genome Project fact sheet, the 2003 reference accounted for about 92% of the genome and had fewer than 400 gaps. It was billed at the time as essentially complete because it covered roughly 99% of the gene-containing, euchromatic portion of the genome, the part where most protein-coding genes live. For the questions the project was designed to answer, that was a reasonable place to declare the work done.
The framing held up better in some respects than in others. The genes most laboratories cared about, the ones implicated in known diseases, were largely in the finished section. The repetitive heterochromatin around centromeres and the rDNA-rich short arms of acrocentric chromosomes were not, and they could not be addressed by simply running the existing pipelines harder. The remaining 8% would need different sequencing chemistry, different assembly algorithms, and the kind of effort that does not attract the public attention of a presidential announcement.
The 2022 announcement, and what it actually delivered
On 31 March 2022, the Telomere-to-Telomere Consortium released the first complete, gapless human genome sequence. The lead paper, “The complete sequence of a human genome” by Sergey Nurk and colleagues, was published in Science alongside five companion papers covering segmental duplications, centromeres, epigenetics, repeat elements, and the consequences for studying human genetic variation. The assembly was designated T2T-CHM13. It is 3.055 billion base pairs long, includes gapless assemblies for all chromosomes except Y, and adds nearly 200 million base pairs of sequence that had never been read before.
The newly added sequence contains 1,956 predicted genes, 99 of which are predicted to code for proteins. It includes, for the first time, the complete sequence of every centromere in the human genome, the short arms of all five acrocentric chromosomes, and a large portion of the segmental duplications that the earlier reference either collapsed or got wrong. As the paper itself puts it, the T2T-CHM13 assembly adds five full chromosome arms and more sequence than any reference release in the previous 20 years.
The consortium was led by Adam Phillippy at NHGRI, Karen Miga at the University of California, Santa Cruz, and Evan Eichler at the University of Washington. About 100 scientists worked on it. The bulk of the funding came from NHGRI. The technical breakthrough rested on two long-read sequencing platforms, Oxford Nanopore and PacBio HiFi, which between them produced reads long enough and accurate enough to bridge the repetitive sections that had defeated short-read methods.
Why the missing regions mattered
The phrase “8% of the genome” sounds like a footnote. The functional content of that 8% is the part of the story that is easier to underplay than to overclaim.
Centromeres are the constriction points where the spindle apparatus attaches during cell division. A centromere that does not function properly leads to chromosomes that do not separate properly, which is one of the routes to the kinds of chromosomal abnormalities that cause cancer and developmental disorders. Until 2022, the centromeric sequence in the reference genome was largely placeholder, an acknowledgement that something repetitive lived there rather than a readable text. The T2T assembly is the first time anyone has had the full sequence of every human centromere to compare to anything else.
The segmental duplications matter for a different reason. According to the UC Santa Cruz announcement of the work, these long, near-identical blocks of DNA are known to play important roles in evolution and disease, and many of them sit in regions that the earlier reference either collapsed into a single copy or assigned to the wrong chromosome. Several of the gene families housed in these duplications are involved in immune function, including clusters relevant to natural killer cell receptors and the broader major histocompatibility region. Saying these regions matter for “immunity” is fair as long as one is precise about what is being claimed. They contain immune-related gene families. They are not the only immune regions in the genome, and the 2003 reference did not omit the immune system as such. It omitted the parts of the immune-relevant architecture that happened to sit inside the unreadable fraction.
What the 2022 work does and does not settle
The T2T-CHM13 assembly is a single human genome. It was generated from a cell line called CHM13, which has the unusual property of carrying only one copy of each chromosome rather than the usual two, a feature that made the assembly problem dramatically more tractable. That choice is what made the work possible. It is also what limits it.
A single complete genome does not describe the human species. It describes one source. The Y chromosome, which CHM13 does not carry, was completed separately and published later. The natural next step, and one the same researchers have been pursuing through the Human Pangenome Reference Consortium, is to build a reference that captures the variation across many individuals rather than the sequence of one. The consortium has been working toward a pangenome based on the complete sequences of hundreds of people drawn from different ancestral backgrounds. That work is in progress, not finished.
The other limitation worth naming is interpretive. Having the sequence is not the same as knowing what the sequence does. The roughly 2,000 newly predicted genes will take time to characterise. The centromeric satellite arrays now have a sequence, but the rules that govern how they actually function during cell division remain partly open. The complete reference makes the questions answerable. It does not answer them.
What is worth carrying away
The 2003 announcement and the 2022 announcement are sometimes presented as if one corrected the other. That reading is a little too tidy. The 2003 work delivered what the technology of the day could deliver, and the people running it were largely honest about what it covered. The 2022 work delivered what nineteen years of additional method development made possible. The interval between the two is the part of the story that does not get a press conference: the slow accumulation of better sequencers, better algorithms, and the willingness of a smaller group of researchers to keep working on the parts of the problem that had been declared finished by everyone else.
The complete human genome, in the strict sense, was published on 31 March 2022. The pangenome that will properly represent the variation in the human population is the next milestone. The work continues.