Using the Chromosome Painter to analyse Ancestry's ethnicity estimates.

+6 votes
210 views

The case of the monochrome chromosomes...

Not sure if this has been done before, but using Ancestry's recent "technological advances" in ethnicity estimates has helped me analyse its limitations. This is really just my interpretations, based on what Ancestry say they do, and the results they present. I will undoubtedly have some details wrong, but I cannot arrive at any different conclusion.
I started writing this before finding the 2021 whitepaper:  so I have rewritten it for what I could see there. The changes in 2021 were the first time that results were split into maternal and paternal sides. I then had to dig into their published paper  to get some help understanding some content in the white paper.

Conclusions:
To save you having to read all this...
Based on my sample size of 2...

  1. The chromosome painter is not to be trusted. The results are often impossible and illustrate the unreliability of the process.
  2. Look at the cross-validation results in the white paper.  Some ethnicities appear to give very reliable predictions and so might be quite trustworthy, others not so much.
  3. many European and in particular British groupings are not highly reliable and you should expect predictions to erroneously assign neighbouring regions. But I cannot see any way to tell whether your case is likely an over- or an underestimate.
  4. make sure you click on the link for each ethnic group on the map page to read the probable range of the estimate. Note that many of the smaller percentages will include zero in the range, and the zero might be the actual answer.

Background
Ancestry have a database (their "reference panel" ) of dna from 71,000 people covering 88 geographic/ethnic regions. For example (relevant to my case) they use the following numbers of people:
2300  England and NW Europe (I shall just refer to this as England),
3400  Germanic,
1700  Scotland,
1300  Sweden and Denmark,
880   Wales.
This is all from living (or recently living) people and is expected to provide DNA that accurately samples geographic populations 2 ot 3 generations ago. It is then assumed (based on family histories) to remain localised many generations further back. They make efforts to weed out any outliers and inconsistencies.

Ancestry divide your DNA into 1001 segments or 'windows' and look at how each segment compares (indirectly) to the reference sets. How big are these segments? For optimum statistics, I guess they would be split so that each group has about the same number of SNPs (about 300 SNP loci, common across different chips they have used, according to the white paper). On average this might be around 3.5cM. The white paper states 3 to 10cM. The published paper states 75 SNPs and 3 to 4cM windows and I assume the white paper details modifications that Ancestry have applied in practice. The actual window selection process is not detailed but is more complicated that simply number of SNPs. Do they omit the same segments as they do when matching cousins? Perhaps not, as persistence is not a problem when looking for segments retained over generations. The painter shows which regions are ignored. The resulting percentage is not simply a fraction of how the window segments are assigned, but is rescaled - based possibly on window size in cM.

I won't attempt to describe the Ancestry evaluation process, because I would probably get it wrong. I can tell that it is nothing like simply comparing our dna snps to the reference sets.

Some statistical estimates

This trivial set of calculations is definitely not how Ancestry does the analysis, but something to give me a mental picture of how my DNA samples compare to the reference sets.

This is really "back of envelope" as so much of the required information is not readily to hand.  I don't even know how many matches I have on Ancestry. I have 13,000 on MyHeritage against a total dataset that was 6.5 million a year ago. Let's guess they currently have 4 million testers with domininant ethnicity in the above groups that comprise my ancestry. That means I might match one in every 300 testers (this does not account for all the false matches MyH reports). So, I should match 30 out of the 9000 reference people - generally I would expect to match only on a single segment (maybe 2 or 3 windows long), with a few matching in more than one place. Even if we double it (I think unrealistically generous), and double it again for matches that might get discarded as below the normal cutoff, that is 120 segments matched out of 1001. Or nearly 90% unmatched. At 300 SNPs there would also be a rather high rate of false (IBS) matches.

The probablility of good matching on every one of the 1001 pieces would be infinitesimally small.


From another viewpoint:

We can tackle this question from a different perspective - by applying a bit of logic to the chromosome painter results reported for me and my mother.

My maternal ancestry, back a minimum of 6 generations, is all central England, nudging into Wales, with nobody known north of Stafford. Yet, my mother is calculated to be 8% Scottish  and 8% Germanic Europe (as well as an unsurprising 18% Wales and balance England). It is not inconceivable that there was a sudden influx of Scots and Germans back a few more generations, especially as the records then leave plenty of room for uncertainty. However, checking the chromosome painter shows how unlikely this interpretation is.

According to Ancestry, my mother's Scottish genetic heritage accounts for 100% of chromosomes 8 and 14 and on no other chromosome. Note: this is both maternal and paternal sides. So, through many generations of recombination, the Scottish genes have conspired to claim all of these chromosomes and have been eliminiated from all others.

Looking at the Germanic component, it has all of chromosomes 4, 6 and 18, but only on the side designated as "parent 2".

Chromosome 4 is reported to be all England on the paternal side and all Germanic on the maternal side. However it contains a known 47cM run of homozogosity, where both strands are identical. So clearly none of this was given a significant score - it must have been a poor fit to any of the reference sets - because that region of chr 4 must in reality show the same ethnicity on both sides.


To pile another straw onto the poor camel, despite my mother's chromosome 8 being entirely Scottish on both sides, my chromosome 8 has zero Scottish, but is entirely English on the maternal side and entirely Swedish/Danish on the paternal.

The Chromosome 14 inheritance at least is consistent, with my maternal side being Scottish, but chromosome 6 is also impossible, being 100% Welsh for my mother's parent 1 and 100% Germanic for parent 2, while I am 100% England on both sides.

From my tree, I could estimate my mother is 90% or more English, with up to 10% from Wales. It is made difficult because the Ancestry map draws most of Shropshire, Herefordshire and Monmouthshire  as partially overlapping England and Wales, so that may make a chunk of my maternal ethnicity more difficult to determine.
My best guess is that, from direct comparison with the reference panel, my mother is identifiably 2.5% English, 0.7% Welsh, and 0.3% each Scottish and Germanic.  The remaining 96% is nothing like any of the reference data and makes no difference either to my trivial calculations or to the fancy statistical models that Ancestry uses. I cannot confirm this at all as I have no access to the data that Ancestry use.

 An alternative view comes from Ancestry's own table 4.1 in their cross-validation. Their own tests show average recall (degree of underestimate) for England to be 0.7, meaning that on average 30% of their average underestimated England answer will be wrongly attributed to their genetically closest neighbours (who are also geographically closest).
Ancestry state my Mother is
66% England, but in the range 48 to 94%, so within the upper limit,
18% Wales, within the range 3 to 24%,
8% Germany, within 0 to 31%,
8% Scotland, within 0 to 20%

So, my estimates all fit within the range offered by Ancestry, but my best guess is that reality is close to the range limits in each case.

What is happening with their fancy Hidden Markov Model analysis to make it come up with the outlandish claims of 100% ethnicity tied to different chromosome strands? My knowledge of Markov models stops with knowing how to spell it. If I had to guess, I would approach it as follows: take chromosome 4 as an example. English on one strand and Germanic on another. There is likely one segment (one or more windows) that gives a strong Germanic score and maybe some good matches to English elsewhere on the chromosome. The model has a weighting to reduce recombination across adjoining windows, which favours retaining the ethnicity of a good match into the next window if that next window has no strong match to anything at all. So, effectively, it will retain the same ethnicity until another strong match comes along - it does not actually operate that way, but I think the end result is the same as if it did.

There are what seem to me contradictory statements about this tendency to propagate a good result throughout the entire chromosome:

  1.  In section 3.4 of the white paper, discussing transition probabilities from one window to the next, it says a "key feature" is that it considers only the states of the current window and the adjacent one. No matter how far away it is from my hypothesised window with a strong score.  Thus a 200cM segment becomes just as likely as a 6cM, even if the Germanic ancestor was 6 generations ago.
  2. In section 3.6 it does refer the Viterbi path being weighted by recombination distance, but I dont understand if/how that relates to point 1.


Another peculiarity of the process is that it assumes ethnicities are more likely associated with the one parent than equally with both. This is reasonable in many cases, but probably not my mother's, where both lines derive from very similar localities. Ancestry acknowledges that the phasing may occasionally flip parent 1 and parent 2, so, unless overridden by a stronger factor, it will always flip sides to keep ethnicities on the one side or the other.  In my mother's case, all the Wales component is assigned to parent 1, while all the Germanic part is parent 2. The Scottish part is split equally. The Wales assignment, in particular, is difficult to believe.

So, does the chromosome painter behave the same way for most people or is it peculiar to my heritage?  My mother has only a single strand of one chromosome reporting more than one ethnicity. I have two strands.

Contrast this with Ancestry's colourful example with 17 difference ethnicities scattered across the genome. Their FAQ answer to "why are my chromosomes just one ethnicity" just does not cut it.

Could it be something to do with European groups being more difficult to separate?

in The Tree House by Cameron Davidson G2G6 (7.6k points)
edited by Cameron Davidson

2 Answers

+5 votes
Thanks Cameron, great post. I have similar British Isles ancestry to yourself and came up with same figure of matching 1 in 300 testers.

I also agree the ancestry chromosome painter is poor. What I do is use Johnny Perl's one with actual matches. I started off by taking the autoclusters file from my heritage, then now and again add the odd person from Gedmatch or my heritage.

I really like ancestry's new communities. I have 2, 1 briadly representing my mother and one broadly representing my father.

But in general I think the ethnicity estimates are 'just a bit of fun' due to the limitations you detail in your analysis.

I really wish ancestry would try and promote the benefits to all DNA testers to add their "8 greats" rather than the "fun gimmicks". Their ThruLines matching method is infinitely superior to my heritage theory of relativity and with the trees they have on their already can quickly match you to other trees, if people have *some* form of linked tree.

Thanks for the post though! Thoughtful analysis! Great stuff!
by Anonymous Farnham G2G6 (6.7k points)
Please be careful with the ThruLines as there are some impossible lines suggested through this tool. However, upon checking with primary sources, I have several valuable suggestions with this resource.
Oh yes, very much agree. It's good but errors still made and you have to go back to sources. Further, there are many connections it misses. However, I would say it's just a case of being cautious and using it as a tool to point you to potential leads, which then should be followed up with other supporting evidence. Very much have same 'imposible' connections. Thanks for the comment!
+4 votes

@Cameron - thanks, that's an excellent post. What caught my eye was  "It is then assumed (based on family histories) to remain localised many generations further back."

I can see why they would like to assume that families are static and localised. It might be true in England, but not true in Scotland. As we know from many clan histories and the Scottish Diaspora, finding a Scottish family that stayed in one place can be a rarity instead of a rule. Sadly, I've no idea what that means for the way Ancestry works.

by Keith Macdonald G2G6 Mach 1 (12.2k points)
Thanks Keith, I may have worded that badly. I thnk they tried to get their reference samples from people whose family histories claimed to be all within the same region. They then performed principal component analysis to try to exclude samples that did not fit the overall pattern.

The People of the British Isles study showed that even England by itself did not have a uniform gene pool, although the central-southern cluster was quite widespread.

Even the Scottish part showed there were some families that remained true to their locality. One line of mine were in the same valley for several generations, occasionally venturing into the adjoining parish.

I had hoped for a more detailed breakdown by testing with LivingDNA as I think they had a connection to the POBI study. It was certainly more detailed, and said 100% Britain, but the fine-grained allocations bore only moderate resemblance to my presumed ancestry.  I suspect the POBI sample size was enough perform that study, but not large enough to cover the full range of customer DNA.

Related questions

+23 votes
13 answers
+13 votes
8 answers
1.2k views asked Jul 20, 2022 in The Tree House by Mark Williams G2G6 Pilot (437k points)
+27 votes
24 answers
+5 votes
6 answers
363 views asked Oct 8, 2023 in Genealogy Help by Nathan Eichenberger G2G Crew (900 points)
+12 votes
2 answers
+8 votes
2 answers
+13 votes
0 answers
409 views asked Dec 15, 2019 in The Tree House by James Stratman G2G6 Pilot (103k points)
+1 vote
2 answers
256 views asked May 25, 2019 in The Tree House by Wayne Prather G2G2 (2.3k points)
+9 votes
3 answers
420 views asked Mar 4, 2019 in The Tree House by Shirlea Smith G2G6 Pilot (286k points)
+3 votes
3 answers
243 views asked Jun 15, 2023 in The Tree House by Pat Miller G2G6 Pilot (224k points)

WikiTree  ~  About  ~  Help Help  ~  Search Person Search  ~  Surname:

disclaimer - terms - copyright

...