Did You Know that the First Complete Sequencing of the Y Chromosome was Published 23 August 2023?

+24 votes
439 views

Published yesterday in Nature by the members of the Telomere-to-Telomere Consortium is the first full sequencing of the human Y chromosome: "The Complete Sequence of a Human Y Chromosome," https://doi.org/10.1038/s41586-023-06457-y.

This adds about 5.2 million base pairs, or around 9.1%, to the count used by the current GRCh38 genome reference assembly. Also of interest is that the researchers identified 17.7% more genes in total than show in GRCh38, and 60.6% more protein coding genes. This yDNA sequencing was the last piece of the puzzle remaining in order to construct an initial pangenome reference.

It's likely many of you are already aware of the work. It began in 2021, the paper was received by the Nature editorial board in December 2022, but didn't finish all the steps in peer-review and see publication until yesterday.

At least for a limited time, Springer Nature has made the full article available for open access at this very long link.

To download a PDF copy or print the article still requires access through Nature's paywall, however.

Also published August 23 in Nature was a separate study that sequenced 43 Y chromosomes, including 21 of African descent. For this work, PacBio HiFi sequencing equipment was used, as well as Bionano Genomics optical genome maps: "Assembly of 43 Human Y Chromosomes Reveals Extensive Complexity and Variation," https://doi.org/10.1038/s41586-023-06425-6.

Of particular interest in this one is the unexpected discovery that the chromosome can vary in length by literally millions of bases. Even coding genes like TSPY can have 23 to 39 copies on the chromosome. Further reinforcement of the need for a pangenomic model.

This article, too, has an open access presentation from Springer Nature at this link.

in The Tree House by Edison Williams G2G6 Pilot (446k points)
Thank you for sharing this with us, Edison.

Keeping in mind that I am not at all a genetics expert like you, I have a probably stupid question:  How do they know that it is actually complete, and that there aren't more out there that just haven't been discovered yet?

You're absolutely correct. The second study I referenced that Nature published simultaneously was interesting in large part due to the considerable variation they found in the Y chromosome. I'm not sure anyone predicted that normal and healthy Y chromosomes could differ by counts of millions of base pairs.

We'd thought for quite some time that we had a good handle on the total number of base pairs in the Y. The GRCh38 reference genome--in place for a decade now and used by FTDNA for yDNA work--has it pegged at 57,227,415 base pairs. The older version of the reference model that we still use for our common autosomal tests, GRCh37, turns out to have been closer with a count of 59,373,566 base pairs. The research by the Telomere-to-Telomere Consortium (T2T) found 62,460,029 base pairs.

But the truth is that we'd never been able to accurately sequence the whole chromosome before. The available technology didn't allow it, and we were sorta guessing by using combinations of other tools.

The FTDNA white paper from March 2019 about the Big Y-700 test gives you a good idea of the large area of the Y chromosome we couldn't sequence previously:

The problem with that region labeled "inaccessible" is that it's densely heterochromatic, or tightly packed and condensed, and highly repetitive. Our current methods of DNA sequencing require that the DNA be broken up into small chunks, and then the sequencing operation does multiple reads of those various bits, typically 30, 60 or 100 times. The data are then aligned sort of like a jigsaw puzzle so that the laboratory can tell which pieces go where.

Think of it as having a really long, Faulknerian sentence that someone has cut up into fragments, each only a few letters long. You have umpteen fragments because each individual letter has been read 30 times, but it appears in different places in each of the fragments. For example, in the phrase "umpteen fragments" you may have one sliver of paper that has "agm"; another that has "n f"; and another that has "een."

Your job is then to reassemble all those tiny pieces of paper to figure out exactly what Faulkner wrote. It's like a nightmare version of Wheel of Fortune, only worse. Because there are only four letters and, collectively, you'd have from about 498 million letters (248.96 million base pairs) on the longest chromosome, Chromosome 1, to 125 million letters (62.46 million base pairs) on the smallest chromosome, the Y.

True, in "shotgun" and short-read DNA sequencing you'd get hundreds of letters per individual segment, but with only those four letters you can imagine the conundrum if you had regions where long and multiple series of repetitions were present. What do you do if your read length is 500 letters and you have a chromosomal region that contains 900-letter sequences that repeat scores of times in a row?

In a nutshell, that's the problem the newest generation of hybrid long-read and nanopore sequencing solves. 

There are highly repetitive areas like that found in chromosomes 1, 3, 9, 16, and 19, plus the entire short arms of the acrocentric chromosomes, 13, 14, 15, 21, and 22. If you use DNA Painter, for instance, you'll see some of these regions grayed-out in the map. But proportionate to size, no chromosome contains more of this mysterious area than the Y chromosome (I'm excluding the inactive X in women, called the Barr Body; that's a whole 'nother topic).

What the new sequencing of the Y chromosome gives us, for the first time, is a clear picture of what the entire chromosome looks like. For genealogy or investigative (forensic) genealogy, we just don't know if there's going to be much in the previously mysterious region that will be of use. Time will tell...but don't expect to see this type of sequencing technology available to us at the big testing/matching companies for a while, or even if it will be commercially feasible in the near term.

But to your point of variability, that's exactly why there's been a press to shift to a pangenomic model, and why the NCBI has indefinitely suspended the release of GRCh39. Over the entire genome, this "mystery area" accounts for around 8% to 9% of everything. The detail we're getting from the newest generation of sequencing technologies is showing us that humans have more genetic differences than previously thought.

Thus the move to go to a pangenome reference and stop using a single model for the global population...a model where the majority of the genetic data was obtained from one man who lived in Buffalo, New York; a man who happened to respond to a newspaper add about DNA testing for research.

Wow, this is all so very fascinating to me. I haven't yet had the opportunity to read the articles that you've linked to, because I've been working on the WikiTree Games challenge. I look forward to reading them all.  Maybe I will even understand some of it. :)   

I can grasp that Y represents the direct paternal line and that fathers give Y only to their sons. What I haven't been able to understand is why there are Ys (or, what appeared to be) in the raw data for me, my mother, and at least one of my daughters.

You've also given me something else to read about - Barr Body.  I look forward to it.

Thank you for taking the time to answer my questions and giving me the opportunity to learn more about genetics.

Suzanne, your finding Y chromosome data in your raw data is something many women--depending on the test they've taken and how deeply they've dived into the results--experience.

Glance back up at that graphic from FTDNA. See the areas at the very ends of the Y chromosome labeled PAR1 and PAR2? These are the pseudoautosomal regions.

I thought for sure I had something bookmarked from one of the popular genetic genealogy bloggers about this, but I don't. I'm certain one of them has addressed it...and would do a much better job explaining it than me. So some Google Fu may be in order. But in the meantime...

The XY combination in males is the only chromosome pairing of distinctly different haploid, or single, chromosomes. All the other chromosomes pair up with their homologous sister, e.g., your dad gives you one Chromosome 7, and your mom gives you one Chromosome 7.

There has to be some mechanism that allows the very different X from a man's mother to pair with the Y from his father. That mechanism lies in the pseudoautosomal regions. The small regions aren't of any real use genealogically, but they do have medical significance...so all the common microarray chips used by the DNA testing companies include SNPs in the PAR. One of the reasons they're called "pseudoautosomal" is that they're the only part of the Y chromosome subject to recombination, like an autosome.

And here's the trick. In the male, both the haploid X and the haploid Y have those PAR regions at the ends of each of the two chromosomes. And...in females each of their two X chromosomes also contain pseudoautosomal regions. Which makes complete sense because when genetic material is created for an ovum, it doesn't know if it's going to be fertilized with another X chromosome or with a Y. The joining mechanism's gotta be there regardless.

One result is that DNA tests that look specifically for SNPs in the PAR are going to ascribe them to either a separate faux-chromosome designation or they're going to include them as being on the Y chromosome.

For example, in their raw results AncestryDNA numbers the autosomes 1 through 22; the X is 23; the nonrecombining portion of the Y chromosome is 24; SNPs in the PAR are shown a chromosome 25; and SNPs in the mitochondrial DNA are 26.

But 23andMe doesn't do that. They have 1 to 22 for the autosomes, then their labels are "X," "Y," and "MT." So the PAR data can look like a female test-taker has yDNA data.

So nothing odd going on at all. It's just a unique chromosomal joining mechanism that the DNA testing companies don't have a standardized way of reporting.laugh

Thank you for explaining this to us. 

A while back someone else had told me that it could be because of having had male pregnancies. That works for me and my mother as we've both had sons, but our oldest daughter has never been pregnant. So, that explanation didn't make sense to us at all, for any of us. 

In regards to what you said here:: "...and SNPs in the mitochondrial DNA are 26,"  does that mean there is a way for me to figure out our mtDNA with the data in 26?  Maybe without going insane in the process? laugh

Hopefully, they will some day have a standardized way of reporting so that it is less confusing (alarming?!) to those of us not well versed in genetics. 

And I learned a new word today…amplicon.

Suzanne: There is a way to get an estimate of the high-level mtDNA haplogroup (or even a subclade a level or two deeper) from AncestryDNA results, but there are some caveats.

Biggest of these is that the earlier versions of Ancestry's test, prior to its introduction of v2 (this means samples submitted prior to May 2016) did not contain mtDNA data. These were tests processed by FTDNA's lab in Houston.

With version 2, AncestryDNA switched to Quest Diagnostics and some mitochondrial data was included in the raw data, but of course Ancestry never reported on it.

To complicate things a bit, there have been various iterations in the chip configurations used for AncestryDNA v2 tests, but Ancestry also never told us about those changes or when they occurred. The v2 tests I've analyzed contained between 164 to 263 mitochondrial SNPs (see my Dec 2019 article on the subject). By comparison, the 23andMe v5 test looks at 4,301 mtDNA SNPs.

So checking the AncestryDNA mtDNA data is sorta like Forrest Gump's "box o' chocolates": there may be something interesting in there, but you never know what you'll get. If meaningful SNPs for your particular haplogroup were tested, you can get useful info.

But unlike yDNA where the SNPs are essentially hierarchical--a man typically won't have SNPs B and C unless he also has SNP A--the mitochondrial genome is tiny and the SNPs are, er, repurposed: a SNP may appear as a component in a dozen haplogroups/subclades. It's the overall combination of SNPs that ends up defining the subclade.

So... About the best we can hope for from examining AncestryDNA mtDNA data is a reasonable guesstimate. There's a good chance the outcome will be definitive at the highest level, basal haplogroup, but probabilities vary considerably below that. I will say, however, that I've had instances of unknown parentage searches where checking the AncestryDNA mtDNA raw data in multiple kits helped determine which side of the family some of the matches were on, and allowed the search to focus on the remainder.

There are a couple of ways to approach doing this, but neither is completely straightforward. I don't think they'd be crazy-making wink but they do take preparation and work. It may not be worth it unless there's a use-case that could benefit (in the unknown parentage instances I mentioned, we had people who had taken AncestryDNA tests and had been willing to supply their raw data files, but who were unwilling to consider taking any additional tests). If that's something you decide you need to do, drop me a private message from my profile.

Geoff: I know, right? "Amplicon" sounds like it should be the name of a Marvel comic book hero...or villain. Or maybe an annual comic-book convention.

But thank goodness we discovered PCR (polymerase chain reaction). We'd never have been able to do what we do with DNA testing and sequencing without it.

Hi Edison,

Thank you for yet another informative response to my questions.  

Please accept my apologies for such a delay in responding. On Sep 3, David (my husband) had a massive stroke and we were at various hospitals and rehab centers for three months. I don't have nearly the computer time, or family history and DNA research time, that I had before the stroke. This new normal has been very difficult in a multitude of ways.

I look forward to reading your article on mtDNA, hopefully very soon. Thank you again.

Have a blessed weekend.

1 Answer

+6 votes

Does this mean FTDNA is now going to upgrade from Y-700 to something more? frown

by Ida Houston G2G6 (7.2k points)

laugh Certainly not any time soon...if ever in my lifetime.

If I had to speculate, some of the information to be found in that densely repetitive stretch of the Y chromosome may be useful for forensic investigations (and perhaps more likely in large-scale population genetics), but I doubt much of it will shed any appreciable new light on genealogy.

If eventually that proves otherwise (key word being "eventually" because that will require a body of new scientific studies and likely take years), it would also be a big expenditure decision on the part of FTDNA. In the meantime, Illumina could come out with a new generation of DNA sequencers, but right now FTDNA (and it's parent company, Gene by Gene) is predominantly an Illumina laboratory. These new full-sequence results from the Telomere-to-Telomere Consortium have depended on a hybrid of sequencing methods, including equipment from companies like Pacific Biosciences, Oxford Nanopore, and Bionano Genomics.

Personal opinion--and we know that, historically, I'm always correct with the only exception being on days that end in the letter "y"--is that we'll first see the hybridized technology available to consumers in the form of whole genome sequencing tests. I can't predict who that might come from, but it probably won't be from one of our "Big 5" genealogy testing companies. At least not early on. I mean, none of them even offer whole genome sequencing yet, and that marketplace price-point has been down around the $200 to $300 range for 30X coverage for over three years now.

If I had a close friend on the fence about ordering a Big Y-700 test today--and the issue wasn't money--I'd be very convincing that the results are the best available for the Y chromosome; that the value from the test will never decrease even if something new comes along; and if it does, that "something new" is still years away.

Related questions

+10 votes
0 answers
+24 votes
2 answers
+4 votes
2 answers
421 views asked Sep 27, 2020 in Genealogy Help by David Martin G2G6 Mach 1 (10.0k points)
+5 votes
2 answers
+20 votes
5 answers

WikiTree  ~  About  ~  Help Help  ~  Search Person Search  ~  Surname:

disclaimer - terms - copyright

...