How much can I rely on DNA segments less than 8CM

+7 votes
1.0k views

Sometimes I have a geneanological relationship with another user, both of us have a dna test (with Ancestry) but we are not listed as matches on Ancestry.

However if we have both uploaded our dna to Gedmatch we can compare segments and often find we do have matching segments, but because none are larger than 8CM not listed as a user on Ancestry

For example here is match between myself and https://www.wikitree.com/wiki/Adams-19658 using his A672670 kit and my ZD1872209 kit. This was done using the Autosomonal one-to-one comparison tool, and i modified the default min segment size from 7cm to 3cm

Chr B37 Start Pos'n B37 End Pos'n Centimorgans (cM) SNPs Segment threshold Bunch limit SNP Density Ratio
1 10,901,907 12,128,498 3.2 248 215 129 0.32
1 244,268,142 245,307,406 4 222 195 117 0.35
6 165,832,227 166,876,708 3 270 198 118 0.32
9 2,203,938 3,194,613 3.2 307 174 104 0.36
10 15,629,360 17,164,221 3.3 342 187 112 0.33
12 78,524,505 82,071,784 3.7 439 262 157 0.29
15 86,731,697 89,034,993 4.5 479 185 111 0.33
19 1,051,214 2,268,055 4.6 228 210 126 0.32
20 60,538,561 61,759,067 3.6 247 203 121 0.29
21 41,998,571 42,876,447 3.4 256 191 114 0.36
22 23,861,112 25,604,477 3.6 247 203 121 0.27


Largest segment = 4.6 cM

Total Half-Match segments (HIR) 40.1cM (1.117 Pct)

11 shared segments found for this comparison.

422606 SNPs used for this comparison.

52.942 Pct SNPs are full identical

Comparison took 0.015.

CPU time used: 0.014.

Now I know a single segment less than 8CM is generally unreliable, but I am assuming that having 11 such segments is too many to be coincidental ?

So my question is how many segments less than 8CM would be considered a good indicator of match, especially when we already have a genanological relationship base don family tree research.

Secondly, my dna test was done on Ancestry a couple of years ago, Simons is  an older one. he also has another test labelled Migration V4 - M

Comparing Kit M322104 (*SimonA) [Migration - V4 - M] and Kit ZD1872209 (Paul Taylor) [Ancestry]

Segment threshold size will be adjusted dynamically with an average of 200 SNPs. About 2/3 will occur between 185 and 214 SNPs.
Minimum segment cM to be included in total = 3.0 cM
Mismatch-bunching Limit will be adjusted to 60 percent of the segment threshold size for any given segment.
 

Chr B37 Start Pos'n B37 End Pos'n Centimorgans (cM) SNPs Segment threshold Bunch limit SNP Density Ratio
2 134,992,922 139,489,417 3.5 377 202 121 0.19
3 25,033,188 27,432,995 3.2 234 214 128 0.19
3 180,765,861 183,786,678 4.7 247 211 126 0.19
12 20,794,191 22,978,045 3.3 265 214 128 0.19
12 78,180,871 82,027,471 4.1 348 255 153 0.21
15 85,678,251 89,034,993 5.3 445 184 110 0.22
18 57,130,979 58,628,364 3.5 190 187 112 0.21
21 41,674,077 42,871,273 4.5 226 196 117 0.24


Largest segment = 5.3 cM

Total Half-Match segments (HIR) 32cM (0.892 Pct)

8 shared segments found for this comparison.

290725 SNPs used for this comparison.

52.898 Pct SNPs are full identical

So I match some similar segments on this test, but not as many and I match some new segments. Im assuming the difference is due to different tests checking different parts of your dna, but I dont quite know how to decode this.



Comparison took 0.038.
CPU time used: 0.015.

Update

Okay I now have SImons fathers dna, results as follows

Comparing Kit ZD1872209 (Paul Taylor) [Ancestry] and Kit XP6290629 (Peter Adams) [FTDNA]

Segment threshold size will be adjusted dynamically with an average of 200 SNPs. About 2/3 will occur between 185 and 214 SNPs.
Minimum segment cM to be included in total = 3.0 cM
Mismatch-bunching Limit will be adjusted to 60 percent of the segment threshold size for any given segment.

 

Chr B37 Start Pos'n B37 End Pos'n Centimorgans (cM) SNPs Segment threshold Bunch limit SNP Density Ratio
1 244,228,699 245,130,641 3.4 203 199 119 0.35
2 215,840,414 217,199,624 3 260 182 109 0.34
2 239,833,716 241,440,020 5.1 385 203 121 0.29
4 32,626,153 37,404,319 3.6 486 200 120 0.26
5 16,491,052 18,777,638 3.6 387 193 115 0.33
5 22,351,139 25,678,072 3.5 398 207 124 0.29
7 11,730,333 13,748,392 3.3 455 182 109 0.32
12 78,524,505 81,016,956 3.2 308 262 157 0.3
14 97,190,794 98,338,357 3.3 252 215 129 0.37
15 26,336,463 27,423,772 4.1 280 182 109 0.35
21 41,581,684 42,876,447 4.9 341 197 118 0.34
22 48,562,589 49,201,704 4.5 271 186 111 0.36


Largest segment = 5.1 cM

Total Half-Match segments (HIR) 45.3cM (1.264 Pct)

12 shared segments found for this comparison.

428613 SNPs used for this comparison.

52.216 Pct SNPs are full identical

Comparison took 0.043.
CPU time used: 0.017.

Ver: Mar 22 2023 01 16 51

So we have a match with the other tests on chromozone 12 and 21, so I assume that is good evidence that these segments are valid. The match on chromozone 15 is at a different location so I guess we can discount that as a valid match. The other matches could possibly be valid because he is a generation nearer to me, but based on what you say are most likely to be invalid.

in WikiTree Tech by Paul Taylor G2G6 Mach 1 (18.8k points)
edited by Paul Taylor

7 Answers

+12 votes
 
Best answer

Well... I already wrote this piecemeal between phone calls and "urgent" emails, so I'm going to post it anyway. We seldom hear from Dr. Millard here, a well-respected voice, so I recommend you heed what he has to say and ignore me. But nothing ventured, and all that...


Part 1

Paul, I would most definitely file all of those GEDmatch results under "ignore forever." There are multiple reasons why small segments are meaningless for genealogy and ancestry, not the least of which is that they're exceedingly likely to be false as reported. I won't do a really deep dive down into the biology and the math...I do that at length here every once in a while, and only a few are, uh, fans of my often exceedingly long posts.

But I'll still bore you with four points. And it's still gonna be long. My perennial excuse--and I'm stickin' with it--is that complex subjects can't be accurately discussed in only a few words. Trying to describe genetics in a single Twitter post is like trying to write The Rise and Fall of the Roman Empire in 20 pages.

The first point is that, no, the presence of multiple small and probably false segments does not increase the likelihood that any of the segments are, in fact, real. A good summary was written by Blaine Bettinger last summer, "An In-Depth Analysis of the Use of Small Segments as Genealogical Evidence." As Blaine writes: "There is no evidence that: (1) triangulating a small segment; or (2) sharing a large segment in addition to the small segment; or (3) finding a shared ancestor, increases the probability that a small segment is valid. To form it as a trite maxim: two wrongs don't make a right.

Second, while we face a lot of challenges with genetic genealogy, one of the most significant is trying to figure out when a segment is actually a segment. The core problem here is that the centiMorgan is in no way a physical measurement. It's merely an estimation of the probability of a crossover (recombination) occurring at certain points in the genome, and the calculations used have changed little since the 1940s, are based on a single reference genome (the majority of which was derived from one man in Buffalo, NY, and which updates the NIH as put in abeyance for now while it listens to the call of scientists to move away form the one-genome reference to a pangenomic model...and the reference we use for genealogy is already a decade and 14 versions out of date as it is), and for genealogy all results of the calculation are reported as sex-averaged values, despite the fact that crossing over occurs in females at a frequency approximately 70% higher than in males; in other words, the female genome calculates to be 70% larger than a male's in centiMorgans.

So then a primary tool of genetic genealogy, the centiMorgan, is highly imprecise to begin with, even if we know--as we might via whole genome sequencing--every nucleic acid base in any given chunk of DNA...but none of the testing/reporting companies share with us exactly what they are comparing. When the calculated centiMorgans are large enough, it doesn't make much difference because when we look at large chunks of DNA the imprecision is less important. An error here and there (in the form of a mismatching SNP or an overlong piece of the chromosome where no SNPs are compared) is ameliorated because we have much more data to work with. Think of it as trying to accurately estimate the number of blades of grass in a large lawn by starting out knowing the lawn's area, but in one instance you have a sample of blade count in 10 square meters as your sample and in the other you only know what the count is in one patch of 10 square centimeters.

Third, as you noted, Paul, there are significant differences among the common microarray tests we take, and have taken. Not including organizations like WeGene or Genes for Good, since the first DTC autosomal test through today, any two sets of raw data may examine as few as 17% of the same markers. (If interested, I prepared an overview JPEG chart of manufacturers and types of microarray chips used.) And never mind in current tests using the Illumina GSA chip that up to 19% of the SNPs targeted for testing are done so expressly for clinical research purposes, not genealogy. In fact, only some of these--like ones that affect phenotype, e.g., hair and eye coloring--have much bearing on ancestry at all. So from an informational perspective that 17% overlap effectively might be lower by a few or several percentage points.

That's not a very big overlap to work with. If the 0.02% of our genome that microarray tests look at is the small patch of 10 square centimeters in the lawn, then testing incongruencies may reduce our effective sample size to 8 square centimeters. Understandably, sites like GEDmatch and MyHeritage which allow uploads from other companies to allow comparisons face a big challenge in accurately evaluating those discrepancies.

Fourth: GEDmatch has to be used carefully. It's an extremely valuable resource, but its most valuable feature is also the one that is most easily misinterpreted and misused: the flexibility to make comparisons at a granular level with many different options.

Many services use various tools like genotype imputation and forms of computational phasing in order to improve the accuracy of their results. GEDmatch does not. They use only simple arithmetic calculations for SNP matching, and they've lowered the default matching thresholds for number of SNPs and SNP density two times. Considering the number of matches AncestryDNA shows me in their market-largest database of about 24 million compared to GEDmatch with only 1.4 million--a database less than 6% the size--I knew there had to be a discrepancy.

Completely informally, I decided one way to evaluate what was happening at GEDmatch was to extract as large a set of data as I knew they would accept (by using the known "templates" of 11 different versions of microarray tests) from my whole genome sequencing SAM file, and use that to create a "superkit" of about 2.1 million SNPs. The dbSNP database has over 650 million human SNPs/SNVs on file, so 2.1 million isn't a massive number, but it's more than the typical 650K from a microarray test. I also created 11 GEDmatch research kits using the same DNA sampled at the same time for as close to apples-to-apples comparisons as I could get.

The hypothesis was that the greater SNP density afforded by 3.3 times the number of SNPs shown in the superkit versus any given set of microarray data would lead to more accurate results because SNP mismatches and long gaps with no compared SNPs would likely be uncovered, revealing that some matching segments reported when comparing only two sets of microarray data would probably be a conflation of what are in fact two or more smaller segments.

by Edison Williams G2G6 Pilot (446k points)
selected by Ken Parman

Part 2

A brief summary of the data is at this Google Sheet. I've been (casually, not earnestly) thinking about trying to locate a child and both parents who have done a 30X WGS or better, and at least a couple of their cousins (preferably 2nd through 4th) who also have done WGS and who are all willing to share their BAM files with me. It would take some computing horsepower and some work in Python or R, but it would be interesting to deconstruct their reported matches and compare the WGS data to the microarray results.

My informal GEDmatch experiment didn't yield quite what I'd expected. Using the GEDmatch default settings as-is, one-to-many runs at ≥ 30.5cM were essentially identical across the board. Deviations of up to 8.5% began as of ≥ 20cM. By the next selectable threshold though, ≥ 10cM, the disparities were somewhat astonishing. The microarray tests differed markedly at that level, with the current ones using the Illumina GSA chipset being the worst performers...meaning the ones showing the greatest numbers of what stood a good chance of being false-positives. At greater than or equal to 10cM, the lowest performer indicated that as few as 1 in every 3.1 reported matches was likely to be valid. 

To look at that particular microarray test data further, I started with the first reported one-to-many match at 10cM and did one-to-one matching with the next 50 of them (the second tab on that Google Sheet labeled "10cM-Sampling"). Those 50 reported matches yielded a total of 69 segments. Of those, only 14 also appeared as matches to the WGS superkit: a potential 79.7% segment error rate as opposed to the aggregate summary rate of 68%.

I hadn't expected potential error rates that high that quickly. My guess had been that at ≥ 20cM I'd see matching rates nearly identical with the superkit, and that the discrepancies at ≥ 10cM would be somewhere around 15-20%...not 70%.

The data implied that, using GEDmatch's default settings, at the level of a reported 20cM the segment would be real roughly 92% of the time. Still not good enough to denote precision, but a fair trade-off as a minimum threshold for genealogical purposes. That the accuracy improved dramatically as we approach 30cM would imply that a sweet spot probably lives somewhere in the low 20s. Conversely, the drop-off from 92% to 30% at 20cM to 10cM respectively, would imply that we need to be well above 10cM to infer an actual match.

There are thresholds we can manipulate at GEDmatch to help with accuracy. One is to never leave the "overlap cutoff" setting at its default of 45,000. That allows far too few of the same markers to be in the comparison. If our microarray tests average around 650,000 markers and the lowest overlap is 17%, that equates to over 110,000 markers. GEDmatch, though, works with "slimmed" versions of the uploaded data, but 17% is the worst-case scenario. I'd advise using 90,000 as the minimum overlap cutoff and dropping to 72,000 with cognizant caution.

The SNP count does definitely matter. Prior to GEDmatch Genesis the minimum was 700 SNPs; with Genesis going into production that became a dynamic range between 200 and 400; now it's "about 2/3 of segments will have between 185 and 214 SNPs." I consider the original 700 somewhat reasonable, but the modifications were made to accommodate those tests that overlap on only a minority of the same SNPs. In order to keep reported match numbers high, GEDmatch decreased the SNP density requirements.

With a very broad brush, 1cM will be approximate, with a lot of flexibility based upon chromosomal location, to about 1 million base pairs. Our microarray tests look only at about one marker in every 4,800 base pairs of nucleotides, on average. At that relative density, there should be just over 200 SNPs per centiMorgan. Anything much lower than that means fewer SNPs have been examined in a comparison between two tests than the approximate physical average across the genome that was tested by the microarray. Caution should be applied. If you see, for example, a 7cM segment reporting 300 matching SNPs when the genomic average should be closer to 1,400, it can be an indication that the comparison is flawed.

Super-lacking in precision, but if you take that 200 SNPs per cM and halve it to 100 per cM, I think you end up with a reasonable threshold that's simple to apply, e.g., a 10cM segment should be comparing, give or take, around 1,000 SNPs. Much lower than that, be skeptical.

At the bottom of the GEDmatch free one-to-one autosomal comparison tool is a little checkbox to "prevent hard breaks." I recommend that be kept unchecked. The distance between matching SNPs of up to a half million base pairs is already arguably excessive. Using that checkbox to allow even larger gaps does nothing to improve accuracy of the comparison.

I won't dig down into a fifth point, one talking about match pile-up areas (these typically originate via something called linkage disequilibrium and mean that many small segments cannot be traced to specific ancestors because they are too old and spread too pervasively throughout similar regional, clan/tribe, and even familial populations). Without this biological foible, the testing companies would have no way to even start trying to provide the "ethnicity estimates" that they do.

A simple way to illustrate the effect this can have is to take two GEDmatch kits from people who share a regional "founder population," like you and I do via Great Britain, and do a one-to-one match at the default settings but with the centiMorgan threshold dropped down to the floor-minimum of 3cM. For instance, you can run your kit against my 23andMe v5 test, ZL4037910. Even though most of my great-grandparental lines were in America before the late 1700s (and at a cursory glance you and I have no surnames in common), you'll find that, with its free autosomal comparison tool, GEDmatch shows us as sharing 19.6cM over 4 segments. With my AncestryDNA v2 test (CS3670291) it's even crazier: 61.4cM over 17 segments.

I'd like to claim cousinship, but it's improbable that any of those segment "matches" are valid.
laugh

Perhaps should look at it another way, if I think I have a genealogical match with someone but after reducing segment size down to 3CM we have no matching segments is that good evidence that we are not related ?

Not at all. Using autosomal DNA as a form of negative evidence is really only possible among close relationships. The general agreement is that 1C1R is the most distant we can go while assuming that the individual will show some amount of matching DNA with us (Henn, et al., 2012; Williams et al., 2020). The Williams Lab at Cornell indicated that even at 2C there is a fraction of a chance that two cousins will share no detectable DNA.

At the level of 3C, the probability is not 100% that any two people will share autosomal DNA (Henn: 89.7%; Williams: 91.8%). By the time we get to 4C, there is less than a 50% chance that any two cousins will share any autosomal DNA (Henn: 45.9%; Williams: 48.5%). To continue the precipitous drop, at 5C the numbers from the respective sources are 14.9% and 15.9%. Of course, half relationships have to be considered as well; for example: h3C, 71.1%; h4C, 28.2%; and h5C, 8.3%. All probabilities assume no recent pedigree collapse in the lines.

When we consider that any two full siblings will, overall, share somewhere between 38% to 61% per 23andMe (the median is close to spot-on 50%, but in a bit of a quirk in the cenitMorgan calculations, most companies will report siblings as showing approximately a 37.5% sharing), we quickly begin to see how the concept of autosomal triangulation to distant cousins breaks down in its logic. The parents pass along 50% of their DNA, 23 haploid chromosomes from each of them. As a broad, theoretical number, we can do a little quick math to estimate how many children a parent needs to have to pass along most of his or her genome: 1-0.5n where n is the total number of children. With four children, over 6% of the parent's genome is still unaccounted for. We'd need six children to get us to 98%. Basically, this is the reason that, by 10 generations back (or 8g-grandparents), we have less than a 50% chance of carrying any of a specific ancestor's DNA (Coop, 2013).

For atDNA triangulation to be a valid methodology, we need to have three independent lines of meiosis events from the MRCA. Several functions during meiosis affect the outcome, so running the numbers is no simple matter. Since 2020 I've tried to interest a handful of well-known bioinformaticians in taking on a detailed simulation to better represent the probabilities, but if we approach it as simple sets of random, independent events--the way we would rolling multiple dies in order to come up with the same values on each die--we would, as an example, be looking at a probability of about 0.075 that any two 5th cousins would share a measurable amount of the same autosomal DNA that came from the same ancestor. We'd simple multiply 0.075 by itself to see the probability for three 5th cousins to meet the same criteria: 0.005625. Even if the math ends up being off by 100% or more, it calls into question the many claims we see of triangulation groups containing a dozen or more test-takers originating from, say, a 5g-grandparent.

From a practical standpoint, given what common tests we use today for genealogy, uniparental DNA, mtDNA and yDNA, can--with varying degrees of accuracy--be a useful form of negative evidence. This is less true for mtDNA than yDNA (see Bettinger, 2018), and Y-STR tests with few markers examined are not wholly reliable, either. Investigative genealogists regularly use autosomal DNA as negative evidence, but do so, typically, only out to the level of 1st cousins.

DNA as a form of positive evidence is tricky to analyze. It sometimes can be even trickier when attempting to use it as negative evidence. wink

Excellent answer as usual, Ed.

The problem I have with single match comparison is that if you look at it in isolation, it's pretty useless for smaller segments for the reasons you and Andrew have stated.

However, what I'm missing from both answers is that there are ways to improve the value of matching segments (note the plural).

One such way is phasing. A phased segment can't jump between between the paternal and maternal allele. Even in cases when no parent has tested there are nowadays excellent algorithms to still achieve pseudo phasing (see Ancestry.com).

The other part is looking at overlapping segments and performing DNA triangulation (and yes, I know what you're already shouting out now, Ed). If I have let's say 50 people that all have overlapping DNA segment, there will be 1225 comparisons between them to perform as part of DNA segment triangulation.

Now, if 96.81632653061224% of these 1225 (meaning 1186) comparisons are triangulating, it's a whole different situation vs not knowing if a 8 cM segment is IBC or IBD.

Lastly, on the value of lowering segment thresholds. Professor Itzak Pe'er has done an excellent presentation on "The genome of the Netherlands" in which he shows how "old" a segment of 1cM to 9cM (in steps of 1cM) is.

I personally still can't decide if 8 or 7 is the best threshold to reflect the genealogical timeframe when it comes to my mostly German ancestors. Church books go back to early 16xx and thus have people in it born late 15xx.

If one switches however to my wife's Singaporean and South-East Asian ancestors, the written resources quickly vanish once we cross 1900, only 6 of her 8 Great-Grandparents are known by name and for only 2 we have estimates on dates but no birth/death date.

So the value of the threshold that one wants to use is always depending on the availability of written information to be able to put names, dates and locations to their common ancestors.
Andread, how do I perform triangulation on Gedmatch, there isa Triangulation tool on gedmatch but that just searches for triangulation amongst the best matches so no good for triangulating these small segments. What I want to do is do triangulation between a set of kits that I choose, and down to 3CM, is that possible ?
I have commented on the benefit (or rather the lack of a benefit) of lowering the centiMorgan threshold already. The need for phased data was also mentioned in my comment.

So if you’re still interested in identifying a common ancestor that is more than 1000 years dead, has no name, no location and is born is a range of several hundreds of years then you need to perform 1-to-1 comparisons for all possible combinations between these people.

Thanks for the best answer star, Ken. Andreas, my friend, I've been trying to get back here and reply. It'll take some carefully spent time, and I just haven't had more than a few minutes here and there recently to give. Wrapping up a deadline on Wednesday should give me a while to breathe. laugh

Andreas, that is not what I said at all !. What i find interesting is that in this question https://www.wikitree.com/g2g/1514784/is-the-relationship-close-enough-for-the-dna-to-prove-match?show=1514784#q1514784 Im being told how unlikely it is that a segment could match someone as far back as 12 generations, whereas in the replies to this question Im being told the opposite, that even if these segments are valid they are probably from an ancestor much further away !

There is alot of useful information here in the answers, I didn't understand the important of phasing before and it does seem that results using gedmatch to compare kits from different companies is inheritantly unreliable.

But for me genealogy is a puzzle game with no 100% answers,  so whilst a few shared segments are not enough to confirm a match they may lead to finding tests from other users, that can in turn then find corresponding segments in similar regions or larger shared segments from other matches, or have researched family tree that back up hypothesis. Now Blaine says that if you are finding larger segments then they are no longer small segments, but the point is that small segments from people with known relationship can be used to find people with larger segments but no known relationship. Without the smaller segments match we would not have considered the larger segment match if person has no documented tree relationship. But once we have an inkling that they may be related then we can start researching tree to see if there is matching shared ancestor.

Whilst I totally get that I probably do a have shared ancestors with other English people with deep history of English ancestors going back a few thousand years I would argue that the chance of being a blood relation with some random person going back 400 years are very low, so if I have some shared dna with someone and I have a paper records that indicate a genealogical relationship with someone it seems more likely that we are related via that shared ancestor than that we are related via some other unknown connection, or not related at all.

My work on AncestryDna matches backs this up, Ancestry may give me a common ancestor match for one person with only 8cm. But if that person has shared matches I can usually identify most of the other people even if they have very little tree, and when I do identify them the match correctly fits in with the original match i.e they share the expected common ancestor

Hi Paul,

comments on your comment:

Im being told how unlikely it is that a segment could match someone as far back as 12 generations, whereas in the replies to this question Im being told the opposite, that even if these segments are valid they are probably from an ancestor much further away !

all our DNA is coming from our ancestors, there are very small segments (microhaplotypes) that are going very far back in time. So I stand by my comment and as I wrote, the presentation from Professor Itsik Pe'er is highly recommended: "Identity by descent in medical and population genomics" and it shows IBD segments down to 1 centiMorgan in the latter part of the video.

On your paragraph about using small segments to identify people with larger segments:

Use DNA triangulation on all (!) segments down to the recommended size (either 7 or 8cM). That will keep you pretty busy for a long time. I have users that have over 250 triangulated groups identified with my app, with tens of thousand comparisons (if not over 100k comparisons) that need to be done between all members of these TG's. IMO there's no need to go further down in 99.9% of these cases. Those that come back as not matching will most likely match at < 7 or 8 cM, whatever your threshold is but they usually match almost all other DNA cousins in the same TG. Again, DNA segment triangulation has a lot of chances of doing something wrong, so I know from experience (as I wrote the app the last 6+ years) that a lot of rules and exceptions are needed to avoid making mistakes.

On your comment about rather being related through an identified shared ancestor:

It's important that you don't search for a DNA connection just because you have a genealogical connection. It should always be the other way round IMO.

Secondly, to those who say "but the connection might be further back and on a different branch" it's very important to collect evidence for your assumption that the MRCA found is the correct path. That includes building out your and the DNA cousins family tree as wide and deep as possible. It also means that you must (!) find more DNA cousins who by themselves have more MRCA that confirm the branch and even provide another 1-2 generations further back (reference to Jim Bartlett's "Walk the ancestor back" method is made).

What is your app?

The Your DNA family app

Interesting, will definently give that a go as soon as I can.
+10 votes

The answer is not at all.

We know that there is a high rate of false matches, increasing as the segment gets smaller. Blaine Bettinger has estimated that 60% of his 6-7 cM matches do not match his parents and Debbie Kennett found 54% were false.

You've made two comparisons with the same person, and only 3 segments are repeated. The others are demonstrated to be false positives illustrating the high rate with small segments.

A paper by Ralph & Coop used a more rigorous matching process than the genetic genealogy sites with many fewer false positives. They found two randomly selected Europeans share segments 74% of the time, and within smaller regions this rate is increased. They state that "someone from the United Kingdom shares on average about one IBD block with someone else from the United Kingdom" (caption to Figure S3).

So if you share 1 to 3 valid segments with Simon that is what we expect for two random people from the same country, because everyone in the country shares thousands of common ancestors from 1000 years ago..

by Andrew Millard G2G6 Pilot (123k points)
Right but I share 11  on one test and 8 on another, so that is significantly more than one to three segments, so doesnt the higher number of small segments indicate liklihood of a match ?

I dont really understand why I match different segments on the two different tests
Does it work this way, for any chromozone each testing company only checks a sample, so the samples checked by different companies overlap but are not exactly the same. So in this example because both test match a similar area of chromozones 12,15, and 21 this is likely to be a valid match because it means in that region the majority of the sample points in my test match the majority of the samples in the other two tests. But the other matching segments are likely to be invalid because my sample points match only one of the two tests, and as the two tests are for the same person it means the overall number of matching sample points is lower?

A true segment should replicate between tests, so you have at most 3 here. That is consistent with what we expect from false positive rates. Your segments average about 4 cM. Before FTDNA revamped their matching criteria, they did a study, summarised here by Leah Larkin showing 81% of 4 cM segments were false. That rate would predict on average 9.0/11 and 6.5/8 segments to be false. Your results of 8/11 and 5/8 are very close to average.

False matches arise usually because a SNP that breaks the run of matching SNPs has not been tested. As the different tests have tested different (but overlapping) sets of SNPs, a false positive segments in one comparison can be broken by the SNP observed in the other test and therefore isn't found in the other.

In these two sentences only 4 words truly match.

  • The quick brown fox jumps over the lazy dog.
  • The quick blown fax bumps over one lady dog.

If I only compared the odd-numbered letters in each word, I would say the 6 words in italics matched, and if I compared even-numbered letters in each word there would be the 6 underlined matching words. Rather like the DNA test comparisons, different subsets of letters lead to different false matches, but true matches remain in both comparisons.

"The quick blown fax bumps over one lady dog."

I am going to blatantly steal that, Andrew! I figure you probably won't want attribution for it...
laugh

(I've previously used "Mary had a little lamb whose fleece was white as snow," but yours is pithier and funnier.)

Okay I get that a 4cm match cannot be relied on, and you say 81% of 4cm matches are likely to be false positives.but then that means 19% percent are likley valid, so 1 in 5 likely to be valid, and since I have 11 isn't it reasonable to assume that at least one of the segments  are  valid and therefore  there is a geanological connection?
Yes, but which one? You would have to have phased results to figure that out. Its a rabbit hole that probably isnt worth the time.
For the purposes of deciding if there is dnamarch to this particular Individual it doesn't really matter which segment is valid, only that there  is a valid segment.

Yes a genuine segment represents a genealogical connection, but if it is very small it can be a very long way back. The one study we have on the age of segments, summarised in the chart from Speed and Balding in the ISOGG wiki, suggests that more than 70% of 2-5 cM segments are from more than 20 generations ago. At that time depth you will have multiple relationships with most of the population of England, and attributing the segment to a particular relationship is very difficult unless you can "walk the segment back" through a series of increasingly distant matches on the same ancestral line.

I didnt understand the point of uploading multiple kits for same person before but I guess a segment that matches multiple kits is much more likely to be valid segment and a good way to increase confidence, do we have any figures for that?

I have also now updated question with fathers test as well.
A genuine segment will always match multiple kits from the same person and one of their parents. This is a necessary but not sufficient condition, as some false matches may do this as well.

Even if a segment is genuine, these small segments are usually from very distant relationships and cannot confirm a specific genealogical relationship. You and Simon (and me* and anyone else with deep English ancestry) are likely to be 20th cousins several times and 30th cousins many times over. You will never be able to find all these relationships genealogically, but they can contribute to your sharing of small segments. For that reason these segments cannot help to confirm that you are 6th cousins once removed.

* I just compared myself to you: 13 segments between 3 and 5 cM. Simon's father also 13 segments, but only two in common with one of Simon's kits and one in common with both kits. I have 11 segments with each of Simon's kits but only 6 are in common between them. I read this as lots of false segments and evidence that we share very distant ancestors. Looking at his well-developed tree and comparing to what I know but haven't yet got onto Wikitree, I am confident that Simon and I are not 5th cousins or closer.
“The quick blown fax…” Great analogy!
+2 votes
I'm glad you asked this, because I always wonder the same thing when doing comparisons on Gedmatch, and at this point I'm ready to send in a third DNA kit.

I also liked setting the thresholds low to expand results. But I also realize that unless there is family tree evidence or something else to support the relationship, then you could just be running into a "pile up region" or whatever they are called. I have a couple of them because my maternal GM can trace most lines back to Mayflower and PGM ancestors. And then on my Dad's paternal side, everyone lived within 20 miles of each other in Finland for centuries before leaving the small towns, and I have thousands of distant cousins there. We have the luxury of church records going back centuries so we can verify things like an eighth cousin who shows up genetically more like a third or fourth cousin or whatever in the DNA databases.

No matter what, I hope to keep digging around with the DNA. The science will no doubt get better and better very quickly. Guess this was meant to sound encouraging.
by Jonathan Pudas G2G6 Mach 1 (10.0k points)
+7 votes
Those really small segments are not reliable at all. Even 8 cM segments can be false positives.

Example: On gedmatch I match kit F999953 on 2 segments:
8.3 cM on chromosome 4
7.5 cM on chromosome 9

Plot twist: kit F999953 is an ancient DNA kit (Rise505), from someone who lived about 3300 years ago!
by Joke van Veenendaal G2G6 Pilot (100k points)
Its always possible to find matches like this but I dont understand the relevance. If I was looking to find a link between two people then I wouldnt start with someone with only 16CM I have matches with much higher figures that i cannot identify. But in this case I already have a genealogical mapping between myself and this person, Im only looking at the dna to see if that can increase confidence that the genealogical relationship is correct or decrease confidence, and I would say it increases confidence.
Unfortunately, based on the science, it doesnt increase or decrease the confidence, because it is so unreliable. You are experiencing confirmation bias, where you see something that looks "right" and are finding a reason to treat it that way. However, since the probability of those small segments being valid is so bad, it is much more likely to be an artifact of the process or random chance rather than actual inheritance.

For example, if you look at Blaine Bettingers blog that Edison linked, there is a very recent post about short segments, showing that in tests where the inheritance was known, segments around 3cM have a 96% chance of being false positives.
Hi Johnathan, I cant find that article can you post a link please.

Paul, you've probably already located the Blaine Bettinger post that Jonathan mentioned, but if not, here you go: https://thegeneticgenealogist.com/2022/08/07/an-in-depth-analysis-of-the-use-of-small-segments-as-genealogical-evidence/.

Okay that article ran test against  23andMe as well where the figure was 75%, and the 96% test had a large amount of Ashkenazi Jewish which are known to be problematic, so I think Johnathon has been rather selective here.
+1 vote
Given the nature of the question, my response is quite simple; I suggest that you subscribe to the genealogy-dna mailing list at https://groups.io/g/genealogy-dna and post the query there.

A delay may occur before the initial message would get through to the list, as groups.io tends to, as a spam protection, subject the first message by every new subscriber to each list, to moderation; requiring approval by the list administrator or by a moderator of a list.

That genealogy-dna mailing list is a good medium for obtaining advice about genealogical DNA testing and usage.
by Bret Busby G2G6 (8.6k points)
+5 votes
Several people have mentioned the importance of phasing for eliminating false positive segments. If you'd like to see a specific example, I wrote a blog post

https://segmentology.org/2015/10/02/anatomy-of-an-ibs-segment/

Edited to add: the SNP density ratio column seems rather low to me. This is because you and Simon tested on different chips. I haven't really paid much attention to that metric, so this is just an impression.
If your match is willing, it would be interesting to see what happens if he phases his data against his father. This is a tool in GEDmatch Tier 1. He could subscribe for just a month to create a phased kit (and enjoy other tools as well).

For those who can't test their parents, I almost think it's worthwhile to test a child just to get phased data if you want to work with small segments.
by Ann Turner G2G6 Mach 1 (17.0k points)
0 votes
Don't disregard small segments. Small overlapping segments are used in triangulation.

===Triangulation With 1.71 cM and 2.40cM===

[[Lynn-944 | Loretta Layman]]

lynneage at comcast dot net

[[Lynn-944 | Loretta Layman]] is a FTDNA Project Administrator. Below, Loretta uses 1.71 cM and 2.40cM for triangulation.

On chromosome 9, there is a match of 9.51 cMs with one Salter descendant, a match of 2.40 cMs with one Carlisle descendant that falls entirely within the Salter segment, and a match of 1.71 cMS with one Donaldson and one Widney descendant that match each other exactly and also fall within the Salter segment.
by Richard J G2G6 (9.5k points)

Richard, I've owed additional information to this thread all week. Just been too busy (and even putting in work hours on Easter Sunday; what follows is what I do with my lunch break frown). But I feel I need to comment on the example you provided because there is nothing about our direct-to-consumer microarray autosomal tests that can remotely recommend the use of segments that small with any hope of accuracy. In fact, even the notion of representing centiMorgan values out to two significant digits is rather pointless: some testing providers do--probably because that's simply the way the decades-old cM formula presents, but also because it looks more "sciency" from a marketing perspective--though the concept of the centiMorgan itself is an imprecise estimate to begin with and working with sex-averaged values as we all do for genealogy means that kind of definitiveness is about as close to impossible as it gets.

Since the example used was Chromosome 9, here's a real-world case using a segment on that chromosome illustrating the impact of sex-averaged calculations. The physical loci of the shared segment are a 29,216,761 start to a 70,574,578 end (which, to be clear, no microarray test can establish with precision; none of our start/stop points are accurate), for a total 41,357,817 base pairs in length. Under the outdated GRCh37 genome map that we still use universally for our genealogy microarray tests, the centiMorgan calculations (per Rutgers University) are:

  • Female genome: 20.5cM segment
  • Male genome: 2.8cM segment
  • Sex-averaged value: 11.5cM segment

The differences are not minor. The female genome undergoes crossing over during meiosis at a frequency approximately 70% greater than that of males. Working with generational commingling and large segments means that sex-averaged values can be used for like-to-like comparisons, suitable for reaching back in time a few generations. But the variances are greatly magnified if we try to deal with tiny segments.

You will find no respectable literature, peer reviewed or blogged, that will indicate segments resulting from microarray tests on the order of 2cM can ever realistically be used as a form of genealogical evidence.

FTDNA, who does report down to two significant digits on their cM calculations and who will display very small segment sizes state, in no uncertain terms, what they profess the reach of autosomal DNA testing to be: "Thus, the autosomal DNA admixture for any given individual roughly comprises the DNA of all of their ancestors within five generations." The bolded emphasis is mine. Five generations equals a set of shared 3g-grandparents, or contemporary 4th cousins.

Part of the reason for that--ignoring the biology for now--is simple numbers. Approximately 8% of the human genome is inaccessible to our microarray tests, which equates to about 248 million base pairs, leaving us with roughly 2.852 billion that might be accessible.

Our microarray tests examine, on average, about 650,000 SNPs. Current versions of the microarrays used by 23andMe, FTDNA, and MyHeritage have over 18% of those SNPs targeted expressly for clinical research purposes. Some of those will still be relevant to genealogy--including SNPs in genes that affect phenotype like eye/hair color--so a conservative estimate is that 10% of the 650K SNPs tested are not useful for genealogy, leaving us with about 585K loci tested.

The typical microarray test will have up to 1% no-calls (loci where the synthetic probe wasn't able to bind with with DNA from the prepared solution); a more likely median would be about 0.6% no-calls. That places us at roughly 581K SNPs.

This means that we're looking at a maximum average of one marker out of every 4,900 base pairs, and a cumulative total of approximately 0.02% of testable base pairs, and only about 0.087% of the SNPs cataloged in the NIH's dbSNP database.

There is no direct correlation across chromosomes because the centiMorgan isn't a physical measurement, only a fixed estimate of recombination based on a reference model of a single genome, but often used for rough estimates is 1 million base pairs per centiMorgan. Using that, it means the best-case, averaged scenario is that our microarray tests will be able to contain data on about 204 markers per million.

However, in the relatively short history of direct-to-consumer autosomal testing, not only different manufacturers but also different iterations of the same version of a specific chip have had different SNPs that they targeted. In the worst comparative instance, only 17% of the same markers were examined between different tests, with the average overlap of the most popular tests about 20% (Lu, et al., 2021).

If we can compare only 20% of the SNPs between two given tests, then we're looking not at 581K SNPs but 116,200. That moves the one-to-one comparisons to one marker in every 24,544, and the per 1 million base pairs count out to 41 per (very roughly averaged) centiMorgan.

To further illustrate the imprecision of centiMorgan calculations, I've written elsewhere here on G2G about a quick comparison I did using the AncestryDNA raw data for two known 2nd cousins who both tested on the same iteration of the chip, about one month apart. A database comparison of the raw data files showed that the two kits targeted the same SNPs. These raw data files were then uploaded to FTDNA, MyHeritage, and GEDmatch for comparison.

There were 11 shared segments. In no instance did any of the three companies report the same start or stop loci for any of the segments. Likewise, none of the three companies reported the same centiMorgan values for any of the segments.

The segment that was most similar among the companies was on Chr 1 where FTDNA calculated a 39.38cM [sic] segment; MyHeritage 40.2cM; and GEDmatch 40.5cM. Providing the GEDmatch inferred start/stop loci to the Rutgers University map interpolator yielded: male genome, 26.5cM; female genome, 54.6cM; sex-averaged, 40.5cM.

The proportionately most dissimilar result was on (just a coincidence) Chr 9. FTDNA calculated a 7.58cM [sic] segment; MyHeritage 24.9cM; and GEDmatch 16.3cM. Providing the GEDmatch inferred start/stop loci to the Rutgers University map interpolator: male genome, 21.7cM; female genome, 10.7cM; sex-averaged, 16.3cM.

The net message here is that we can upload the same data to different testing/reporting companies and the returned information will disagree to a degree that rejects the reasonable use of very small segments...even if we could verify physical matching of the SNPs tested and even if we had adequate SNP density to draw a conclusion about IBD matching.

Well, you could not use such small matches for triangulation on WikiTree to confirm a relationship. Our rules require overlapping matches of at least 7cM.

Related questions

+17 votes
3 answers
904 views asked Dec 3, 2015 in The Tree House by Peter Roberts G2G6 Pilot (712k points)
+7 votes
5 answers
+3 votes
1 answer
378 views asked Jan 25, 2019 in Policy and Style by Paul Skiles G2G Crew (670 points)
+3 votes
2 answers
165 views asked Jun 25, 2019 in Genealogy Help by Carrol Fish G2G6 (7.9k points)
+3 votes
2 answers
265 views asked May 7, 2018 in WikiTree Tech by Elizabeth M Piper
+15 votes
3 answers

WikiTree  ~  About  ~  Help Help  ~  Search Person Search  ~  Surname:

disclaimer - terms - copyright

...