More Dirty Little Secrets of DNA Testing
More than 26 million DNA samples have been collected since the new field of commercialized personal genomics was announced with a big splash in the pages of the journal Nature over thirteen years ago. The vast majority come from customers of Ancestry.com (14 million), 23&me (9 million), and a handful of other companies who offer an analysis based on the SNP chip next-generation technology first developed by Illumina ($3.54 billion sales in 2019). In 2018 as many people bought these personal ancestry tests and allowed companies to store and share their results as during the entire previous period of 2012 to 2017. According to industry analysts, the numbers are expected to increase, heading to the 100 million mark in the next few years.
We don’t want to disparage any of these businesses. On the contrary, we believe all approaches and results need to be weighed together to form the best opinion of ancestral origins. That includes family traditions as well as alternative methods such as short tandem repeat population matching, DNA Consultants’ specialty since 2006. Continuing in the spirit of “Dirty Little Secrets of DNA Testing,” originally published in 2009, this blog post seeks to update users on some of the vagaries of ancestry testing.
Here are ten “dirty little secrets” we think you ought to know about the industry.
Ancestry is just a business for most of the “big box” companies. In the same way that opticians make their money on frames rather than lenses, 23&me is counting on a huge payoff from development partnerships with Big Pharma after harvesting customer health data. Tests are sold at introductory low prices that often do not even cover costs, a strategy that discourages competition and innovation. The real prize lies elsewhere. Ancestry is owned in part by data companies. All these firms plough “confidential” customer data and social media megadata back into their business to improve product and “add value”— that is so they can charge higher service fees and market to more people. None of the big companies was founded by a genealogist, historian, geneticist or really even a scientist. They all began as business propositions. Their commitment to genetic genealogy is shown by a reliance on automation and cheap efficiency and virtual absence of customer service.
Percentages are not accurate, though they pretend to be so down to the fourth decimal point. Aside from the fact that percentage estimates of ancestry are not scientific or logical (see next “dirty little secrets”), they are disguised as precise measures. A customer may be told, for instance, that she is “likely” 0.04 percent “Native American.” If this were an actual measure it would mean that 4 ancestors of hers living some 200,000 years ago were “Native American” out of 10,000. For rough approximations, such a fine breakdown is not necessary, but it adds to the mystique of accuracy, thoroughness, reliability . . . and credibility.
Ancestries cannot be analyzed according to percentages because all populations are mixed and there are no pure or unmixed reference populations available. This principal seems obvious on the face of it, yet percentages are the way we all like to understand our “roots.” For instance, we may say, “My mother told us children we were half German and half Scottish with a bit of Cherokee.” DNA testing may find Polish instead of German and no Cherokee, and the Scottish may be an artifact of history determined by their neighbors where the family lived. All family traditions distort their heritage and favor some nationalities (known as “prestige” ancestries) while sweeping others under the carpet. The German grandfather may have come from Poland. All populations, and especially ancient ones, are mixed. Mixed upon mixed generation after generation cannot produce anything “pure.”
Categories of nationality or ethnicity seem arbitrary. We’ve all seen the ad for an ancestry test on TV with the man who found out he was more Scottish rather than German and traded in his lederhosen for a kilt. The makers of the primers and chips used in micro-assay sequencing instruments constantly change or improve the software and chemicals, resulting in changes for what customers are told about results. To avoid climbing lab prices, some companies cut corners with their reagents or equipment. Everything is automated, and no one notices until customers complain. Not only is there a lot of error in this type of testing, but the rules keep changing. There are issues of both validity and reliability that are not addressed up front. In the early days of personal genomes, I was told my highest quantum of ancestry was Finnish, or Uralic, though I had no knowledge then, nor do I now know of any ancestor coming from Finland or Scandinavia. Detailed genealogy back to the 1600s has satisfactorily identified almost all lines, and none could be of that mystery type.
You can’t check out the book if your library doesn’t have it. On the flipside of the Finnish question, I can’t tell you how many customers come to us after being told by the big companies they have no American Indian. Our staff says, “That’s because the other companies don’t have the data.” If you don’t have the data you can’t draw conclusions. Their genomic data for American Indian comparisons is extremely limited and out of date. Unless you use STR reference populations, as we do, it is unlikely you will find a satisfying answer to American Indian admixture questions with SNP chip companies.
Twins get different results? Five different big companies (not including DNA Consultants) processed the samples of a set of identical twins and produced wildly different results, as an exposé showed recently. Is this the way genetic variation is supposed to work? Short answer: no. Yale scientists were “mystified” by the different results for the identical twins, who submitted the same chromosomal input.
Companies do not use publicly available data or standardized analysis methods and do not contribute freely to the state of knowledge. Essentially, 23&me and the others operate in secrecy. They do not feel the need to prove anything to other authorities. They are not certified or accredited or regulated in any way. Their data is proprietary and their results are not replicable or verifiable. No consideration is given to opposite or different conclusions.
Haplotyping fallacies about American Indians and others. To show an origin in Native Americans big company dogma says you must be A, B, C, D or (sometimes) X. We are finding that seaborne haplogroup diffusion was equally viable as land migration in prehistory. Haplogroups H, I, J, K, T, U and V are just as fundamental to the history of Indians in Eastern North America as the A-through D lineup.
Isolationism versus diffusionism. Mainstream genetics simply does not have a provision for ancient people to move about in appreciable numbers by ship, only to travel across land, so no haplogroup is considered to have expanded except by land and usually in a random, starlike pattern, like a spreading stain. This makes the Bering Land Bridge a continuing article of faith. Modern-day scientists also believe that oceans, rivers and other large bodies of water were barriers to migration, when they may have been promotors, as in the settlement of Polynesia. The belief that societies separated by oceans grew up and developed independently from each other is called isolationist theory. The opposite school is diffusionism, largely frowned upon and ridiculed by modern-day anthropologists and geneticists. Ever since the popular voyages and writings of Thor Heyerdahl, in academic circles “the adherents of transoceanic contacts between the ancient civilizations have become, as it were, ostracized and taboo… while isolationism has become the dominant paradigm” (Horst Friedrich, 1998).
When is a haplotype really a haplotype?
Ideally, and classically, a haplotype is defined by its full set of mutations that are unique to it and it represents the beginning of a single germline and lineage. Such a definition allowed us to perfectly match two individuals – say, Cases 24, 25 and 26 in Phase I of Cherokee DNA testing, all having a distinctive form of T1*. These individuals were completely unknown to one another before testing. Two of them claimed Melungeon ancestry; the other’s was unknown because it involved a sealed adoption. Case 26 was a distant cousin of mine with the same surname [Yates] whom I did not know before he became a customer (see Yates and Yates, 2014, p. 68). But on the hard evidence of a rare exact match between their haplotypes we could conclude without the slightest doubt that they were all descended in the strict female line from a single woman who lived hundreds or thousands of years ago, and we could build a case that in some instances, T was an American Indian, specifically Cherokee, lineage. With confusing standards, such comparison is not possible.
Such investigations can no longer be conducted except in extraordinary circumstances and with special protocols. The reasons are not just because the biggest public resource for matching haplotypes, Mitosearch.org, has disappeared, amid concerns for privacy in the European Union in 2018, but also because genetics companies have moved to a different standard, the Reconstructed Sapiens Reference Sequence, or RSRS, which is based on a theoretical Mitochondrial Eve rather than an Englishman named Anderson who was H2a2a1 in 1981. Whereas the Cambridge Reference Series, the standard until 2012, took account of all mutations, the RSRS rounds off and trims in reporting a lineage (cardinal sins in science), giving us now only the possibility of “fuzzy matches” rather than precise equivalencies.
In 2012, when next-generation sequencing companies were launched and it was proposed that the revised Cambridge Reference Sequence (rCRS), should be replaced by a new Reconstructed Sapiens Reference Sequence (RSRS), Martin Richards and the other founders of haplotyping studies countered with severe warnings. Among their caveats, in addition to pointing out the errors of such company-sponsored references in the past, and confusion in notation schemes, was:
In principle, alignment [between a haplotype and its notational name] should not depend on any reference sequence for documentation [italics added for emphasis]. Unfortunately, alignment to a single sequence (whether ancestral or extant) would ignore the well-known fact that alignment and phylogeny estimation cannot be separated as independent and subsequent tasks. It remains to be investigated to what extent the current Next Generation Sequencing tools indeed violate this principle and hence may give suboptimal results.
This seems to say that the Next Gen analyses based on in silico analysis (commercial computer simulations) rather than in vitro data (actual research lab results) are naturally going to lead to mistakes in typing and notation. No haplotype is truly unique anymore, only a “consensus” type.
Dorene Soiret, an independent researcher, contacted Family Tree DNA about discrepancies in one of her project participant’s mitochondrial haplotype results in September 2020. Here’s what she learned from the horse’s mouth:
The results provided to the customer are, indeed, results from the RSRS. I asked when their customers’ test mitochondrial DNA RSRS test results were computed, if the missing or extra mutations are included when running that comparison – THEY ARE NOT. I asked if it meant that the people who are listed in the Exact HVR1, HVR2 and Coding Region mitochondrial DNA matches could have missing and extra mutations different from the test subject – YES, THOSE INDIVIDUALS CAN HAVE DIFFERENT MISSING AND EXTRA MUTATIONS. So we will never know whether those HVR1, HVR2 and Exact Coding Region Matches to my special participants are, in fact, true, exact matches. Here is the important part to keep in mind – the test subjects are being given a CONSENSUS haplogroup, NOT results tailored to each individual customer. For me, personally, a consensus haplogroup does not work for what I want to accomplish, alignment of lineages going back in a recent timeframe to the same woman.
The difference between companies’ “fuzzy matches,” which are intended, one imagines, to promote interest and connectivity with their customers (in other words, sales), and exact matches, which can further genetic science and help genealogical research, are profound. A “one-off mutational match” can be to a totally unrelated branch of the haplogroup on the other side of the world, whereas an exact match can mean only one thing. The two persons are both descended from a single woman who lived in a rather shallow time-depth in a specific location, say 2,000 years ago in Scandinavia. It is the difference between a genetic and a genealogical cousin.
Bottom line: Do you want to be descended from an ancestor or an algorithm?
“Kangaroos among the Cherokee,” podcast episode of The Time Traveler’s Suitcase
“Basic American Indian DNA Test Explodes Old, Tired Theories,” news release (December 31, 2019)
“American Indian DNA Test Is Newest, Most Inclusive and Most Sensitive,” news release (November 19, 2019)
Dorene Soiret, “The Case for BlueSky and Parker Adkins: Rebuttal,” website (September 5, 2020).
Comparisons between Companies
|Types of Testing||STR, Y, Mito, Specialty||Genomic SNP chip only||Genomic SNP chip only|
|Type of Results||Statistical prediction of forensic matches||Percentages of ethnicity by algorithm||Percentages of ethnicity by algorithm|
|Delivery of Results||Paper & digital||Online||Online|
|Customer Service||Unlimited, personalized||Low level, automated||Low level, automated|
|Privacy||Nothing shared, 100% confidential||Results used internally, sometimes shared & sold||Results used internally, sometimes shared|
|Method||Forensic Database||Consensus-based customer data||Consensus-based customer data|
|Populations||520 forensic populations, incl. countries of Europe and 60+ Native American tribes||31 regions||22 regions|
|Testing offered since||2003||2007||2012 (exited Y, Mito testing)|
|Extras||Books, blog, podcast, genealogy services||Health reports, kinship suggestions||Genealogy databases|
|Turnaround||2-3 weeks||2-3 weeks||8 weeks|
|Cost||$99-299 (shipping free)||$99 up (plus shipping)||$99-199 (plus shipping)|
|Pros||Independently owned and operated, personalized service, comprehensiveness||Many health reports, low cost, social networking||Very large customer database, link to genealogy tools, social networking|
|Cons||Higher lab costs (in vitro, not in silico)||Owned in part by Big Pharma, analysis validity issues, questionable marketing messages||Owned in part by data companies, category analysis issues, negative customer reviews|
 H.-J. Bandelt et al., “The Case for the Continuing Use of the Revised Cambridge Reference Sequence (rCRS) and the Standardization of Notation in Human Mitochondrial DNA Studies,” Journal of Human Genetics 59 (Dec. 5, 2013), pp. 66–77.