Q matching at Gedmatch. Need help please

+5 votes
389 views
How should I interpret  matches with less than 7cMS and/or 700 snps with a lowish Q score, that are still there when a phased kit is used?
in Genealogy Help by Tim Gatty G2G6 Mach 1 (10.5k points)
recategorized by Ellen Smith
Hi, Tim. Are you able to reference manual phasing with one or both parents, or is the phased kit you mention one that was created by GEDmatch via their computational phasing approach?

The "Q" score algorithm has been a bit nebulous since its introduction, and I've never seen any actual large-set data examining how well it truly works. It looks good on the surface, but there are a lot of assumptions in play. GEDmatch has noted that "the list of factors is proprietary," so we don't have a full picture of what's actually happening under the hood.

BTW, you'll typically want to leave the tool set to "random probability" and set the default "precision value" to at least 30. That's what the default precision value was when Q Matching was first introduced, and later GEDmatch lowered that baseline all the way down to 7, just like they lowered thresholds for the free one-to-one matching tool.

Increase that even higher than 30 if you want more stringent results. This setting has a significant impact on the way the algorithm functions (has to do with cumulative allele value frequencies for a particular SNP in the GEDmatch database), and the resulting Q value reported will be reset based upon the number entered for precision. In other words, a precision of 30 means the resulting Q value would start with that as a 0 baseline, and any segments with a lower value will be excluded from the report.

And it's worth noting that the Q value attempts to winnow out Identical By Chance (IBC) matches only. It does nothing to take into account Identical By State (IBS) matching where a segment may be valid, but be so common within a given population that it represents a pile-up region.
If you did phase to your parents’ kits, it is also possible that your matching segment is unphased on the match’s side, that is, jumping between your match’s chromosomes rather than being a contiguous DNA segment on one chromosome that was inherited from one of their ancestral lines.

I would advise against lowering the cM threshold, as it sounds like you may have done since you mention segments less than 7cM. If anything, I would raise it, especially when you are getting low overlap SNP numbers.

Thanks for your reply. I used 

GEDmatch Phased Data Generator Data Entry Form, my late father's kit and mine . I did use <7cM because I'd heard that the Q score tool was good at finding genuine matches that were <7cM.

What's coming across very strongly is that as a useful tool Q score is a lot more hype than substance. I can't, based on answers received, see any point in using it.

Edison, I'm not aware of computational phasing at GEDmatch. Where did you find this?

My bad, Ann: loose language. blush

I didn't mean a haplotype inference from population clusters a la BEAGLE, SHAPEIT2, or not-my-cousin Amy Williams's HAPI-UR.

I went shorthand to save words. Hah! Me, save words. laugh

I guess what I should have said would, more accurately, have been: "Do you have both parents tested so that you can directly compare an identified segment to see if it exists in one of them, or are you relying on GEDmatch's tool as a surrogate genome for a parent?"

Ann not only knows all this stuff, but she authored and pioneered a lot of it for genetic genealogy. So now that I'm not saving words, by way of explanation of my error--because what I wrote wasn't accurate and didn't go into any detail--I'll stop addressing Dr. Turner and just ramble.

A lot of people, I believe, assume that the GEDmatch phasing tool creates a surrogate parental genome since running it with even one tested parent produces standalone pseudo-kits for both parents, appended with the suffix M1 or P1 for maternal and paternal.

If I remember correctly, GEDmatch adopted an algorithm by John Walden and, while even its use with only one tested parent can be magnificently useful, the computation required to create the pseudo-kit has limitations. 

The first of these is that what's created as a pseudo-kit is not a surrogate genome. We inherit only 50% of either parent's DNA. So when the phasing tool steps through the data to construct the pseudo-kits, only 50% of the parental DNA is present in us to use. That isn't immediately apparent in the tool's instructions and, to complicate matters, if you run the phasing tool and then look at the information for a created pseudo-kit, you're likely to get a false impression.

Let's take a real-world example of an instance where we had three daughters and their mother take AncestryDNA v1 tests at the same time, so we know that all were tested for the same SNPs. The father was deceased. For each daughter, the phasing tool was run using their mother's kit.

Looking at the details for the absent father's pseudo-kit, in each instance we get similar numbers:

Kit PA*****P1 was uploaded on [date]
Number of original SNPs is 632442
Usable SNPs is 632442
Usable SNPs (slim) is 536123

That sure looks like a complete test kit, doesn't it? In fact, the reported number of "Usable SNPs (slim)" is roughly as large, or even larger, than any of the actual tests uploaded. So how can that be if any one daughter carries only 50% of the father's DNA?

Well, it can't. And I've never found a good explanation for why the GEDmatch data reports this way. If only we could download the raw data for those pseudo-kits and look at it.

Conversely, since each half of a daughter's DNA came from the father, doing a one-to-one comparison on any daughter to the father's pseudo-kit should give us a typical parent-child half-identical value. Yep:

Largest segment = 151.8 cM
Total Half-Match segments (HIR) 3560.3cM (99.265 Pct)
Estimated number of generations to MRCA = 1
46 shared segments found for this comparison.
617597 SNPs used for this comparison.

But... What happens if we compare the father's pseudo-kit from Daughter A to that from Daughter B:

Largest segment = 109.8 cM
Total Half-Match segments (HIR) 1700.4cM (47.408 Pct)
Estimated number of generations to MRCA = 1.5
47 shared segments found for this comparison.
616147 SNPs used for this comparison.

Ah ha! These two pseudo-kits represent the same man's DNA, but now we see what's really happened: each kit is, in fact, showing only half the amount of DNA that would be expected if we were comparing normal test data from the same person. From the matching alone it looks like a grandparental or avuncular relationship.

Therefore, if a cousin's segment appeared to match both the daughter and the pseudo-father, it's a pretty good bet the segment is valid. However, the absence of a match in the pseudo-father can't be used as negating evidence and, furthermore, in many instances the pseudo-father's overall match to the cousin is going to look, by the numbers, to be one generation more distant than it actually is.

Another, though less significant, issue with GEDmatch phasing is that it can't reliably determine the correct allele for all individual SNPs by its process of elimination if only one parent was tested. An example: Let's say a daughter's test showed A and G at a particular locus, and the tested mother's data also shows an A and a G at that position. We know from that the missing father contributed either an A or G for the SNP, but we can't tell which value came from the father because both are present in the mother's test. 

In the scenario above, we had three daughters and the mother tested and were were able to use Blaine Bettinger's visual phasing technique to map a much better picture of the father's genome. Another tool at GEDmatch that can help with this is "My Evil Twin Phasing."

What I don't have a handle on is how well the GEDmatch phasing tool is currently dealing with test versions that differ significantly on the SNPs they examine. In other words, an AncestryDNA v1 test from 2015 on a parent run against a recent 23andMe v5 GSA chip on a child where only 25% of the same SNPs were tested should, in theory, be able to deduce only around 12.5% of the missing parental genome, not 50% of it as did the same-version AncestryDNA tests done of the three daughters and their mother.

That was interesting. I persuaded my father to do a test with FTDNA in 2016.  He wasn't really into the DNA side of things, and so let me have access to his results . He died in July this year. My mother wasn't tested. I made a phased kit using my father's raw data.

1 Answer

+2 votes
Technical discussion about the value and use of Q score apart:

I can’t see a point in using segment matches < 7cM unless you have a triangulated group where amongst the comparisons within DNA cousins of that TG you have no valid matching segment > 8 cM (which I’d recommend) at the locus that the TG is covering.

There might be a misread(s) causing two smaller segments when that DNA kit with the misread(s) is compared against every other DNA kit in that TG.

Keep in mind, the smaller the centiMorgan, the further back in time you go with the common ancestor (assuming it’s a IBD segment). So with 7 cM it’s very easy to hit a common ancestor that is outside of the genealogical timeframe already, meaning 500 years and older.
by Andreas West G2G6 Mach 7 (76.3k points)

Related questions

+5 votes
2 answers
1.9k views asked Jul 9, 2021 in The Tree House by Peter Roberts G2G6 Pilot (712k points)
+3 votes
4 answers
379 views asked Oct 20, 2022 in WikiTree Tech by BB Sahm G2G6 Mach 3 (32.0k points)
+5 votes
1 answer
1.5k views asked Mar 31, 2018 in The Tree House by Bennet George G2G6 Mach 2 (23.4k points)
+3 votes
1 answer
+29 votes
5 answers

WikiTree  ~  About  ~  Help Help  ~  Search Person Search  ~  Surname:

disclaimer - terms - copyright

...