In this post I will share some tips on the final aspect of data matching that was glossed over in parts 1 and 2 - scoring matches. This is maybe the hardest part of the process, but it also requires the most domain knowledge so it’s hard to give general advice.
Recap
In the previous posts we started with two datasets “left” and “right”. Using tokenization and the magic of spark we generated for every left record a small bunch of right records that maybe correspond to it. For example this record:
1 2 3 4 5 6 7 |
|
got these two as candidate matches:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
|
And now we need to decide which - if any - is(are) the correct one(s). Last time we dodged this problem by using a heuristic “the more keys were matched, the better the candidate”. In this case the record with Id 'a'
was matched on both name and phone number while 'c'
was matched on postcode alone, therefore 'a'
is the better match. It worked in our simple example but in general it’s not very accurate or robust. Let’s try to do better.
Similarity functions
The obvious first step is to use some string comparison function to get a continuous measure of similarity for the names rather than the binary match - no match. Levenshtein distance will do, Jaro-Winkler is even better.
1 2 3 |
|
and likewise for the phone numbers, a sensible measure of similarity would be the length of the longest common substring:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
|
This makes sense at least if the likely source of phone number discrepancies is area codes or extensions. If we’re more worried about typos than different prefixes/suffixes then Levenshtein would be the way to go.
Next we need to come up with some measure of postcode similarity. E.g. full match = 1, partial match = 0.5 - for UK postcodes. And again the same for any characteristic that can be extracted from the records in both datasets.
With all those comparison functions in place, we can create a better scorer:
1 2 3 4 5 6 7 8 9 10 |
|
This should already work significantly better than our previous approach but it’s still an arbitrary heuristic. Let’s see if we can do better still.
Scoring as classification
Evaluation of matches is a type of classification. Every candidate match is either true or spurious and we use similarity scores to decide which is the case. This dictates a simple approach:
- Take a hundred or two of records from the left dataset together with corresponding candidates from the right dataset.
- Hand label every record-candidate pair as true of false.
- Calculate similarity scores for every pair.
- Train a classifier model on the labeled examples.
- Apply the model to the rest of the left-right candidate pairs. Use probabilistic output from the classifier to get a continuous score that can be compared among candidates.
It shouldn’t have been a surprise to me but it was when I discovered that this actually works and makes a big difference. Even with just 4 features matching accuracy went up from 80% to over 90% on a benchmark dataset just from switching from handpicked weights to weights fitted with logistic regression. Random forest did even better.
One more improvement that can take accuracy to the next level is iterative learning. You train your model, apply it and see in what situations is the classifier least confident (probability ~50%). Then you pick some of those ambiguous examples, hand-label them and add to the training set, rinse and repeat. If everything goes right, now the classifier has learned to crack previously uncrackable cases.
This concludes my tutorial on data matching but there is one more tip that I want to share.
Name similarity trick
Levenshtein distance, Yaro-Winkler distnce etc. are great measures of edit distance but not much else. If the variation in the string you’re comparing is due to typos ("Bruce Wayne"
-> "Burce Wanye"
) then Levenshtein is the way to go. Frequently though the variation in names has nothing to do with typos at all, there are just multiple ways people refer to the same entity. If we’re talking about companies "Tesco"
is clearly "Tesco PLC"
and "Manchester United F.C."
is the same as "Manchester United"
. Even "Nadbor Consulting Company"
is very likely at least related to "Nadbor Limited"
given how unique the word "Nadbor"
is and how "Limited"
, "Company"
and "Consulting"
are super common to the point of meaninglessness. No edit distance would ever figure that out because it doesn’t know anything about the nature of the strings it receives or about their frequency in the dataset.
A much better distance measure in the case of company names should look at the words the two names have in common, rather than the characters. It should also discount the words according to their uniqueness. The word "Limited"
occurs in a majority of company names so it’s pretty much useless, "Consulting"
is more important but still very common and "Nadbor"
is completely unique. Let the code speak for itself:
1 2 3 4 5 6 7 8 9 10 11 12 |
|
The above can be interpreted as the scalar product of the names in the Bag of Word representation in the idf space except instead of the logarithm usually used in idf I used a square root because it gives more intuitively appealing scores. I have tested this and it works great on UK company names but I suspect it will do a good job at comparing many other types of sequences of tokens (not necessarily words).