What is the largest GEDmatch Superkit in WikiTree?

Question

What is the largest GEDmatch Superkit in WikiTree?

The current FTDNA Family Finder and MyHeritage tests do have minor differences (for want of their own, internal naming conventions, MyHeritage v2.0 tests 610,128 SNPs--they may be up to a v2.1 now, but I don't have the data--and FTDNA Family Finder v3.1 tests 613,624 SNPs). They both use the same Illumina GSA chip, but the optional, programmable microarray probes differ a bit; MyHeritage selects more medically-related SNPs than FTDNA, for one. Combining the two into a GEDmatch superkit won't make a huge difference in overall coverage, though.

The greatest SNP coverage would come from someone who has taken a whole genome sequencing test and has included at least their VCF file(s) into the mix. However, GEDmatch discontinued VCF uploads last year, before my WGS data came back, so I never got the chance. Louis Kessler did, however, and described his superkit experience in one of his blog posts. A note here is that GEDmatch only accepts data in GRCh37/hg19 format, and I'm unaware of any WGS service that any longer uses that older reference genome map.

A VCF contains only those alleles (not just named SNPs) where you differ from the applied reference genome, so ideally you'd want not just VCF data, but all the relevant alleles from your genome, as extracted from the BAM or FASTQ files. Dr. Ann Turner probably would, but I don't know how large the GEDmatch master database of SNPs actually is; it's certainly only a fraction of what's cataloged in dbSNP...and my guess is that it may still be smaller than the as many as 5 million SNPs that would be the baseline for mostly complete genomic differentiation between individuals.

There are ways to extract the data from WGS BAM or FASTQ files and convert the positions to GRCh37, but it isn't easy and requires significant computing power and space; a 30X coverage BAM can be up to about 60 gigabytes in its compressed, digital format; my two extracted FASTQ files are over 270 gigabytes in total. Can't exactly use Excel.

A quick final note is that some people will expect the greater number of SNPs in a GEDmatch superkit will equal more matches...but it doesn't really work that way. It's a bit like the additional refinement you get from upgrading a 37-marker yDNA test to 111 markers. You may gain a few new matches, but you'll see far more go away, especially if you use what I feel is a lowest criterion of 7cM and 500 SNPs (I typically use 1,000). The reason is that the additional SNPs uploaded can fill-in spaces in a presumed segment that had otherwise been empty (or below the mismatch bunching threshold) and then identify two smaller segments--under the reporting size threshold--where before one larger one was assumed.

The goal of using a superkit should be to gain accuracy, not to find more, as-yet-undiscovered, matches.

I won't post an answer because the measly 929,188 SNPs used on the superkit I list on WikiTree is going to be far from the largest. But someday I may find the time to do all the manipulation required to try to build a superkit from my WGS results.

commented Jun 12, 2020 by Edison Williams G2G6 Pilot (451k points)

My new catch-phrase every time Pip comments on my complete lack of ability to explain things with pith and piquancy: Haud yer wheesht, laddie! Dae ye nae ken that genetics is gey complex? (Though, Pip: trust but verify. All I truly know about DNA is that I know less now than I thought I did a few years ago.)

Julie: The short answer is: Not that I know of; and probably not very soon, especially since the sales growth for genetic genealogy tests have been in the tank since mid-2018.

Unfortunately--you guessed it!--it'll take a bit to answer the questions with any meaningful detail.

Whole genome sequencing doesn't result in just one raw data file like our inexpensive microarray chip tests. With my 30X sequencing last year I got--lemme check--12 distinct files, and missing at least two I sorely wish had been provided.

It's worth noting that the files' contents all are dependent upon which version of the human genome reference map was in use by the testing company at the time the results were provided. For example, none of the providers of WGS tests that I'm aware of still use data mapped against GRCh37/hg19, which was released February 2009 and has since been found to have thousands of discrepancies and inaccuracies. GRCh38/hg38 was released December 2013. But we're still using GRCh37 for genealogy because that's what we started with and companies wouldn't be able to cross-compare unless they all switched to a newer release simultaneously...and they'd have to either convert your previous raw data to the new reference map for you, or tell you to go fish; figure it out yourself. That might effectively kill what market remains for genealogy because kits between old and new reference mappings really couldn't be directly compared. And my prediction is if we do up and move to a new reference, it won't be GRCh38...which really is at the end of its own lifecycle; it will the be next major release after that.

What GEDmatch used to accept as an upload but stopped is what's called a VCF file; stands for variant call format: https://en.wikipedia.org/wiki/Variant_Call_Format.

It's a plain text file in a standardized format that's extremely useful as a standalone for medical and population genetics purposes...but not for genealogy. The issue there is that the file shows only those alleles where your genome differs from the reference genome. Sounds okay at first glance, but the problem is that for genealogy we need to be able to verify that a shared segment is really a segment. The VCF only deals in where you're different, not where you're also the same. So you and your cousin might share an interesting indel (insertion/deletion) at a particular location on a chromosome, but it could involve only several or even a single nucleic acid. No segment to compare.

That may well have been why GEDmatch axed the uploads. VCFs take up a lot more room than our microarray test results; about 20MB compared to 1GB. If a VCF is uploaded in combination with, say, three sets of results from different microarray chips and versions, the combined superkit could be very useful because the microarray results give us a comparison baseline to go with all the differences. But just the boatload of differences by themselves aren't terribly helpful unless everyone has uploaded such a file.

One of the types of files I wish I'd received, but didn't, is called a gVCF, "genomic variant call format." This is an extended VCF that includes certain blocks of allele information that do match the reference genome map. GEDmatch never accepted these, though. And it's still not your complete genome: those reference matching blocks are selected because they provide qualifying information about the differences, and not all of them will be genealogically useful.

As I mentioned upstream, the real raw data from a WGS comes in the form of FASTQ or BAM files. These are huge. I think 30X coverage for WGS is and will prove to be the functional minimum for accuracy, but we may well see 60X become price-point feasible for general consumer purchases in the near future. My two FASTQ files, one forward and one reverse read, total right at 270 gigabytes. A 60X test won't double that, but it won't be remarkably smaller than double. There's just no networking infrastructure on the horizon that will let us routinely upload 400-gigabyte files that cloud server resources on the other end will then manipulate and store and subsequently run multiple comparisons at a time against.

I can think of at least two scenarios where WGS for genealogy would be feasible, and I'll bet someone is working on one, both, or something I haven't thought of. But to my knowledge, there's no packaged solution yet. With some lengthy and tedious work, I could take my GRCh38 FASTQ data and eventually extract GRCh37-mapped data--based on Illumina and Thermo-Fisher microarray specs plus info about rsIDs tested by the major companies plus my variants as identified in my VCF...plus maybe whatever else I could research about SNPs seen as indicative by population geneticists--and build a honkin' big data file formatted just like, say, the 23andMe standard, and then begin trying to upload it to GEDmatch to see how large a kit I can get accepted for tokenizing.

But, yeah. Get back to me on that. By the time I can find enough hours to get something like that done, someone will already have a commercial solution for us.

commented Jun 13, 2020 by Edison Williams G2G6 Pilot (451k points)

Thanks, Edison. Just a couple of quick followup questions, and I hope I can articulate them in a comprehensible way.

It seems to me that you've said in the recent past (I doubt I could easily find the source) that WGS testing (for genealogy) will be widespread in the near future and might even replace the inexpensive microarray tests we all buy now. But how will those tests be useful if they can't be compared on a relatively inexpensive mass scale?

I've also heard many times that our current DNA testing targets the areas where our DNA differs from other people's, and that most of our DNA is 99% (or 98%, or 99.8%, or some other number; I don't remember exactly) identical. What I thought that meant was that the 99% could be ignored for comparison purposes. So wouldn't that be true of WGS comparisons too? In other words, substantially reducing the need for dealing with those huge files?

And if extraction of the relevant data from the BAM or FASTQ files is difficult and the data can't just be e-mailed around the internet, isn't the solution for the company that performs the test to also provide those extractions?

Sorry if I'm unclear. You probably know by now that I sometimes struggle to understand these concepts. (And I hope it's not just me!)

commented Jun 13, 2020 by Living Kelts G2G6 Pilot (555k points)

2 Answers

Answer 1 · 2020-06-12T23:50:38+0000

See, Darlene's mom and dad have me beat already.

Peter, to my knowledge GEDmatch doesn't explicitly describe what that "slimming" process is and, since we can't download those tokenized raw results, we can't really compare to the originals that make up a superkit. One thing that seems quite clear, though, is that there are at least two steps in the process.

First is that, while each SNP rsID (the assigned name, for those following along) is unique, there are many different rsIDs that point to the exact same chromosome locus. Multiple SNP names are eventually sorted out in the NIH's dbSNP database, but that takes time and, because GEDmatch has to deal with test results that may be almost a decade old now and using any of a number of older rsIDs, it still has to be able to accommodate and correctly identify those redundant rsIDs. There are many examples, but one is that a tested SNP we know is in GEDmatch's database can be identified by any of these: rs5779653, rs866634870, rs34717272, or rs573552607. It's the exact same point (192,780,089) on chromosome 1. I've recorded some in-use SNPs that have as many as 15 redundant aliases.

So it makes sense that pass number one in the tokenization process is to cross-match all those rsIDs that indicate the same locus, and consolidate the (hopefully matching) values down to a single rsID, deleting the extra ones from the superkit's record.

But if two or more testing companies use the same rsID, you would still have redundant allele values left. So next would come stripping out all the homozygous alleles.

For example, let's say that you upload your results from AncestryDNA, 23andMe, and Living DNA, and they all have tested the SNP named rs123456...and you happen to show C C (two cytosine alleles) at that position. For analysis and comparison purposes, two C alleles need to be retained even though they're both cytosine, but the extra four from two other company's results are superfluous; it would just be redundancy that takes up space and slows comparison times.

I think this SNP name-matching and removal of duplicates accounts for the entirety, or certainly almost the entirety, of the substantial decrease between "usable SNPs" and "usable SNPs (slim)" once a superkit has finished tokenizing.

So "usable SNPs" are the ones from all the individual kits that were found to be usable based on the structure of the GEDmatch database, and "usable SNPs (slim)" is the final, slimmed-down version that has the duplicates removed and is stored and used for comparisons and analyses.

But I could be totally wrong.

commented Jun 13, 2020 by Edison Williams G2G6 Pilot (451k points)

Answer 2 · 2020-06-13T15:59:47+0000

My Superkit is a combination of Ancestry v1 & v2 and 23andMe v4 and v5. It has 1,238,531 usable SNPs, slimmed to 915,743 for the one-to-many algorithm.

Usable SNPs eliminate no-calls, indels, mt or Y SNPs, or SNPs with very rare alleles in GEDmatch's internal database. The rationale for the last point is that nearly everyone will match on the major allele, so they don't help rule out false positives. I do wish they had kept those SNPs, as they might be informative in different contexts.

Slimming removes all SNPs with a heterozygous genotype. These are universal matches (at least one of your two alleles will match at least one allele in the other party). Thus they don't help rule out false positives, either.

As Edison Williams mentioned in his response, the main value of a Superkit is to rule out false positives. This is especially true when the SNP overlap is low.

Categories

What is the largest GEDmatch Superkit in WikiTree?

Please log in or register to add a comment.

Please log in or register to answer this question.

2 Answers

Please log in or register to add a comment.

Please log in or register to add a comment.

Related questions