In this and the next post I will explain the basics of data matching and give an implementation of a bare-bones data matching pipeline in pyspark.
Datamatching
You have a dataset of - let’s say - names and addresses of some group of people. You want to enrich it with information scraped from e.g. linkedin or wikipedia. How do you figure out which scraped profiles match wich records in your database?
Or you have two datasets of sales leads maintained by different teams and you need to merge them. You know that there is some overlap between the records but there is no unique identifier that you can join the datasets on. You have to look at the the records themselves to decide if they match.
Or you have a dataset that over that years has collected some duplicate records. How do you dedup it, given that the data format is not consistent and there may be errors in the records?
All of the above are specific cases of a problem that can be described as: Finding all the pairs (groups) of records in a dataset(s) that correspond to the same real-world entity.
This is what I will mean by data matching in this post.
This type of task is very common in data science, and I have done it (badly) many times before finally coming up with a generic, clean and scalable solution. There are already many commercial solutions to specific instances of this problem out there and I know of at least two startups whose main product is data matching. Nevertheless for many usecases a DIY data matching pipeline should be just as good and may be easier to build than integration with an external service or application.
An example
The general problem of data matching will be easier to discuss with a specific example in mind. Here goes:
You work at a company selling insurance to comic book characters. You have a database of 50.000 potential clients (these are the first 3):
1 2 3 4 5 |
|
and you just acquired 400.000 new leads:
1 2 3 4 5 |
|
You need to reconcile the two tables - find which (if any) records from the second table correspond to records in the first. Unfortunately data in both tables is formatted differently and there are some typos in both (“Burce”, “Ogtham”). Nevertheless it is easy for a human being to figure out who is who just by eyeballing the two datasets. from the first table and from the second clearly refer to the same person - Bruce Wayne. And matches - Bruce Banner. The remaining records - Thomas Wayne and Alfred Pennyworth don’t have any matches.
Now, how to do the same automatically and at scale? Comparing every record from one table to every one in the other - comparisons - is out of the question.
The impatient programmer approach
Like any red-blooded programmer, when I see a list of things to be matched to other things, I immediately think: hashmaps. My internal dialogue goes something like this:
- Let’s make a lookup {name -> Id} using the first table and iterate over the second table
- But the names are in a different format, they won’t match
- OK, let’s normalize the names first, strip punctuation, put the words in alphabetical order
- Still won’t match - Bruce Banner is abbreviated to B. Banner in one of the tables
- Fair enough. Let’s have two separate entries in the lookup for ‘Bruce’ and ‘Banner’ both pointing to Id = 3
- but we don’t want to match just any ‘Bruce’ or any ‘Banner’ to this guy
- Yeah, will have to use the remaining attributes to verify if a match is legit. Generate potential matches by looking up every word in the ‘name’ field, and validate by checking if the other fields look similar. This should work
- what if the name is empty or misspeled, but all other fields match up perfectly? Shouldn’t this be still considered a match?
- Now you’re being difficult on purpose! FINE. Let’s have multiple lookups - name -> Id, phone -> Id, etc. A pair of records will be considered a potential match if they have at least one of those in common
This is beginning to sound unwieldy, but it’s basically the correct approach and - I strongly suspect - the only workable approach. At least as long as we’re not taking the hashmaps too literally. But let’s reformulate it in slightly more abstract terms before diving into implementation.
Generalising
Datamatching must consist of two stages:
- Generate candidate matches
- For every record, pick the best match among the candidates
In this post I will assume that we have a good way of evaluating candidate matches (step 2) and concentrate only on step 1. In fact 2 is crucial and usually harder than 1 but it’s very case-specific. More on that topic next time.
Generating candidates
When is a pair of records a good candidate for a match? When the records have something in common. What? For example one of the fields - like phone number or name or address. That would definitely suffice but it’s too restrictive. Consider Bruce Wayne from our example. In the first table:
1 2 3 4 5 6 7 |
|
And in the second table:
1 2 3 4 5 6 7 |
|
Not a single field in common between these two records and yet they clearly represent the same person.
It is tempting to try to normalise phone numbers by stripping area extensions, fix misspeled names, normalise order of first-, middle-, last name, etc. And sometimes this may be the way to go. But in general it’s ambiguous and lossy. There will never be a normalisation function that does the right thing for every possible version of the name:
1 2 3 4 |
|
What we can do instead is extract multiple tokens (think: multiple hashes) from the name. A pair of records will be considered a candidate match if they have at least one token in common.
We can for example just split the name on whitespaces:
1
|
|
and have this record be matched with every “Bruce” every “Thomas” and every “Wayne” in the dataset. This may or may not be a good idea depending on how many “Bruces” there are in this specific dataset. But tokens don’t have to be words. We can try bigrams:
1
|
|
Or we can try forming initials:
1
|
|
If we did that, then for instance both “Bruce Wayne” and “Burce T. Wayne” would get “B. Wayne” as one of the tokens and would end up matched as a result.
If we tokenize by removing vowels, that would go a long way toward fixing typos and alternative spellings while introducing few false positives:
1 2 |
|
There are also algorithms like Soundex that tokenize based on how a word sounds regardless of how it’s spelled. Soundex gives the same result for “Bruce” and “Broose” and “Bruise” or for “John” and “Jon”. Very useful given that a lot data entry is done by marketers who talk to people (and ask their names) over the phone.
Finally, nothing stops us from using all of the above at the same time:
1
|
|
With this, all the different ways of spelling “Bruce Wayne” should get at least one token in common and if we’re lucky few other names will.
This was an overview of name tokenization. Other types of data will require their own tokenization rules. The choice of tokenization logic necessarily depends on the specific data type and dataset but these the general principles:
- tokenization should try to fix ambiguities in the data. If there are multiple ways a real world entity can be represented in the dataset, good tokenizer would give all of them the same token
- tokens should be specific enough records representing different real world entities only rarely get a token in common
One not name-related example: phone numbers. Since people enter phone numbers in databases in one thousand different formats with all kinds of rubbish area codes and extensions you shouldn’t count on raw phone numbers matching perfectly. An example of a sensible tokenizer is one that first strips all non-digit characters from the phone number then returns the last 8 digits.
1
|
|
Or to guard against people putting extensions at the end of phone numbers, we can extract every consecutive 8 digits:
1
|
|
This should catch any reasonable way of writing the number while still having very low likelihood of a collision.
TL;DR
To match two datasets:
- Tokenize corresponding fields in both datasets
- Group records having a token in common (think SQL join)
- Compare records withing a group and choose the closest match
Coming up: how to implement this in spark.