Friday, May 27, 2016

Assessing AncestryDNA's New Matching Algorithm

Earlier this month, AncestryDNA rolled out an improved matching algorithm for their DNA matching service.  There was a lot of concern over whether this would lead to people losing access to some of their key matches.  As a result, I posted a solution that allows people to compare "unrelated" DNA matches.  This hack allows people to continue to see shared matches with people who have dropped off their match list because they do not meet the new minimum confidence threshold.

I've spent the last few weeks analyzing the changes to AncestryDNA due to this improved algorithm. The most obvious change was a significant reduction in the number of matches rated at the "Extremely High" confidence level.  This was not due to a reduction in the amount of shared DNA; to the contrary, most of my top matches gained DNA.  Rather, this was due to Ancestry tightening range that fits within the Extremely High category.  Previously most all of my 4th-6th cousin matches fit in this category. Now it is only a small handful on the first page.  I am a big fan of this change because it makes this confidence level much more meaningful to the average person.  Previously everyone from a full sibling to people as distant as 6th or 7th cousins could show up with an "Extremely High" confidence level.  Now, this is reserved for the highest matches, essentially those that are most likely to have an easily traceable family connection.

I compared the matches listed as "Extremely High" across a few profiles I manage to their profiles on GEDMatch.  I found that the Extremely High category correlates nicely to persons who are at approximately 4.0 generations or closer on GEDMatch.  Previously people as distant as 4.5-5.0 could show up in the Extremely High category on AncestryDNA.  Thus it is now possible to use the confidence level indicator on AncestryDNA as a shorthand replacement for the more numbers-based predictions of relatedness on GEDMatch.  An average user without much understanding of the underlying shared segments and shared centimorgans can now look at this confidence level and instantly tell if there is likely to be a discoverable connection.

But in addition to these general impressions, I thought it might be helpful to approach this with some actual data on the accuracy of AncestryDNA's predictions before and after the switch.  So prior to the change, I made notes on the predicted relationship and shared DNA amount of predicted 3rd cousins and higher across four DNA profiles.  These profiles had a total of 42 3rd cousins or higher.  I compared the old predicted relationships and centiMorgan measurements to the new.  Here is what I found:

Name Old Prediction New Prediction Actual Relation Old cM New cM % Change
M.M. Close Close 1/2 Sib 1881 1955 3.93407762
C.M. 1st 1st 1st 926 964 4.10367171
J.D. 1st 1st 1st 796 837 4.89844683
M.S. 2nd 2nd 1C1R 350 365 4.10958904
N.T. 2nd 2nd 2C 284 294 3.40136054
J.T. 2nd 2nd 2C 266 283 6.00706714
S.F. 2nd 2nd 2C 263 285 8.36501901
R.S. 2nd 2nd 3C1R 211 221 4.52488688
G.H 2nd 2nd 2C 201 223 9.86547085
H.N. 2nd 2nd 2C 199 212 6.53266332
B.M. 2nd 2nd 2C 191 207 7.7294686
R.B. 2nd 3rd 3C 172 170 -1.17647059
L.B. 2nd 3rd 2C1R 170 154 -9.41176471
D.F. 3rd 3rd 2C1R 168 192 12.5
B.S. 3rd 3rd 2C 157 176 12.1019108
A.B. 3rd 3rd Mult 155 161 3.72670807
S.S. 3rd 3rd 2C1R 149 156 4.48717949
R.W. 3rd 3rd 2C1R 142 148 4.05405405
J.B. 3rd 3rd 2C1R 138 135 -2.22222222
C.Z. 3rd 3rd 2C1R 132 145 8.96551724
C.A. 3rd 3rd 2C1R 129 137 6.20155039
K.M. 3rd 3rd Unk 125 130 3.84615385
B.W. 3rd 3rd 2C1R 123 127 3.1496063
A.N. 3rd 3rd 2C1R 123 136 10.5691057
J.C. 3rd 3rd 3C 116 127 8.66141732
R.W. 3rd 3rd 3C 110 112 1.78571429
I.C. 3rd 3rd Unk 108 121 10.7438017
R.S. 3rd 3rd Mult 99 100 1
A.R. 3rd 3rd Mult 98 104 6.12244898
J.B. 3rd 4th 2C1R 95 69 -27.3684211
I.A. 3rd 3rd 3C 92 112 21.7391304
K.C. 3rd 4th Unk (4th+) 82 39 -52.4390244
D.K. 3rd 4th 3C 81 76 -6.17283951
M.M. 3rd 4th Mult 76 71 -6.57894737
N.L. 3rd 4th 3C1R 76 45 -40.7894737
K.S. 3rd 4th 3C1R 72 69 -4.34782609
A.P. 3rd 4th Unk 72 58 -19.4444444
K.M. 3rd 4th 4C 72 77 6.49350649
R.L. 3rd 4th Unk 69 66 -4.34782609
R.D. 3rd 4th Unk (4th+) 65 20.7 -68.1538462
B.R. 3rd 4th Unk (4th+) 64 56 -12.5
R.E. 3rd 4th 3C 62 58 -6.4516129

Average change: 10.738%

A couple of caveats.  First, I simplified equivalent relationships for the "Actual Relation" column.  A first cousin twice removed is genetically equivalent to a second cousin.  So in this table both are listed as second cousins.  Also whenever the two people had multiple connections (such as being double third cousins or a third and a fourth cousin) I simply put "Mult."  Those without a tree and without a documented connection are "Unk."  Finally, all of these profiles are non-Jewish and non-endogamous populations, though there is a fair mix of origins (colonial American, recent German and Irish and Polish immigrants).

As the last column indicates, everyone had a change to the amount of DNA shared.  Most of these were positive changes.  Those increases were typically small when considered as a percentage of the total DNA shared.  All of the top matches gained shared DNA.  However, at the bottom of the table, we see most of the lower 3rd cousin matches actually lost some cMs, resulting in them being demoted to predicted 4th cousins.  A few of these lost huge amounts of DNA.  The two largest drops (R.D., -68% and K.C. -52%) were known to be no closer than 5th cousins, so they are now much more in line with what my research indicated.  This table also illustrates that while relationships above 3rd cousin can be predicted fairly accurately, below that level there is significant overlap.

So what is the final verdict? Did it improve the accuracy of predictions? After filtering out the people with multiple relationships or completely unknown relationships, I found the old algorithms' predictions were correct 67% of the time.  The new algorithm  improved this number to 85%.  

This is based only on a small sampling of profiles (four) but thus far it seems to confirm Ancestry's claims of more precise matching under the new algorithm.

