In the last post I talked about the principles of data matching, now it’s time to put them into practice. I will present a generic, customisable Spark pipeline for data matching as well as a specific instance of it that for matching the toy datasets from the last post. TL;DR of the last post:
To match two datasets:
- Tokenize corresponding fields in both datasets
- Group records having a token in common (think SQL join)
- Compare records withing a group and choose the closest match
Why spark
This data matching algorithm could easily be implemented in the traditional single-machine, single-threaded way using a collection of hashmaps. In fact this is what I have done on more than one occasion and it worked. The advantage of spark here is built-in scalability. If your datasets get ten times bigger, just invoke spark requesting ten times as many cores. If matching is taking too long - throw some more resources at it again. In the single-threaded model all you can do is up the RAM as your data grows but the computation is taking longer and longer and there is nothing you can do about it.
As an added bonus, I discovered that the abstractions Spark forces on you - maps, joins, reduces - are actually appropriate for this problem and encourage a better design than the naive implementation.
Example data
In the spirit of TDD, let’s start by creating a test case. It will consist of two RDDs that we are going to match. Spark’s dataframes would be even more natural choice if not for the fact that they are completely fucked up.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 |
|
Tokenizers
First step in the algorithm - tokenize the fields. After all this talk in the last post about fancy tokenizers, for our particular toy datasets we will use extremely simplistic ones:
1 2 3 4 5 6 7 8 9 10 11 12 |
|
Now we have to specify which tokenizer should be applied to which field. You don’t want to use the phone tokenizer on a person’s name or vice versa. Also, tokens extracted from name shouldn’t mix with tokens from address or phone number. On the other hand, there may be multiple fields that you want to extract e.g. phone numbers from - and these tokens should mix. Here’s minimalistic syntax for specifying these things:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
|
And here’s how they are applied:
1 2 3 4 5 6 7 8 9 10 11 |
|
The result is a mapping of token -> Id in the form of an RDD. One for each dataset:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 |
|
Generating candidate matches
Now comes the time to generate candidate matches. We do that by joining records that have a token in common:
1 2 3 4 5 |
|
Result:
1 2 3 4 5 |
|
With every match we have retained the information about what it was joined on for later use. We have 4 candidate matches here - 2 correct and 2 wrong ones. The spurious matches are (1, 'c')
- Bruce Wayne and Alfred Pennyworth matched due to shared address; (2, 'a')
- Bruce Wayne and Thomas Wayne matched because of the shared last name.
Joining the original records back to the matches, so they can be compared:
1 2 3 4 5 6 7 8 9 |
|
Finding the best match
We’re almost there. Now we need to define a function to evaluate goodness of a match. Take a pair of records and say how similar they are. We will cop out of this by just using the join keys that were retained with every match. The more different types of tokens were matched, the better:
1 2 |
|
We also need a function that will say: a match must be scored at least this high to qualify.
1 2 |
|
And now, finally we use those functions to evaluate and filter candidate matches and return the matched dataset:
1 2 3 4 5 6 |
|
The result:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 |
|
Glorious.
Putting it all together
Now is the time to put “generic” back in the “generic data matching pipeline in spark”.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 |
|
To use it, you have to inherit from DataMatcher
and override at a minimum the get_left_tokenizers
and get_right_tokenizers
functions. You will probably want to override score_match
and is_good_enough_match
as well, but the default should work in simple cases.
Now we can match our toy datasets in a few lines oc code, like this:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 |
|
Short and sweet.
There are some optimisations that can be done to improve speed of the pipeline, I omitted them here for clarity. More importantly, in any nontrivial usecase you will want to use a more sophisticated evaluation function than the default one. This will be the subject of the next post.