# Missing Data Imputation With Pymc: Part 2

In the last post I presented a way to do Bayesian networks with pymc and use them to impute missing data. This time I benchmark the accuracy of this method on some artificial datasets.

### Datasets

In the previous posts I showed the imputation of boolean missing data, but the same method works for categorical features of any cardinality as well as continuous ones (except in the continues case additional prior knowledge is required to specify the likelihood). Nevertheless, I decided to test the imputers on purely boolean datasets because it makes the scores easy to interpret and the models quick to train.

To make it really easy on the Bayesian imputer, I created a few artificial datasets by the following process:

1. define a Bayesian network
2. sample variables corresponding to conditional probabilities describing the network from their priors (once)
3. fix the parameters from step 2. and sample the other variables (the observed ones) repeatedly

With data generated by the same Bayesian network that we will fit to it, we’re making it as easy on pymc as possible to get a good score. Mathematically speaking, the bayesian model is the way to do it. Anything less than optimal performance can only be due to a bug or pymc underfitting (perhaps from too few iterations).

The first dataset used is based on the famous wet sidewalk - rain - sprinkler network as seen in the wikipedia article on Bayesian networks.

The second, bigger, is based on the LUCAS network

And the biggest one is based on an example from some ML lecture notes

For each of these networks I would generate a dataframe with 10, 50, 250, 1250 or 6250 records and drop (replace with -1) a random subset of 20% or 50% of values in each column. Then I would try to fill them in with each model and score the model on accuracy. This was repeated 5 times for each network and data size and the accuracy reported is the mean of the 5 tries.

### Models

The following models were used to impute the missing records:

• most frequent - a dummy model that predicts most frequent value per dataframe column. This is the absolute baseline of imputer performance, every model should be at least as good as this.
• xgboost - a more ambitious machine learning-based baseline. This imputer simply trains an XGBoost Classifier for every column of the dataframe. The classifier is trained only on the records where the value of that column is not missing and it uses all the remaining columns to predict that one. So, if there are n columns - n classifiers are trained, each using n - 1 remaining columns as features.
• MAP fmin_powell - a model constructed the same way as the DuckImputer model from the previous post. Actually, it’s a different model for each dataset, but the principle is the same. You take the very same Bayesian network that was used to create the dataset and fit it to the dataset. Then you predict the missing values using MAP with ‘method’ parameter set to ‘fmin_powell’.
• MAP fmin - same as above, only with ‘method’ set to ‘fmin’. This one actually performed so poorly, (no better than random and worse than most frequent) and was so slow that I quickly dropped it from the benchmark
• MCMC 500, MCMC 2000, MCMC 10000 - same as the MAP models, except for the last step. Instead of finding maximum a posteriori for each variable directly using the MAP function, the variable is sampled n times from the posterior using MCMC, and the empirically most common value is used as the prediction. Three versions of this model were used - with 500, 2000 and 10000 iterations for burn-in repectively. After burn-in, 200 samples were used each time.

### Results

Rain-Sprinkler-Wet Sidewalk benchmark (20% missing values). Mean imputation accuracy from 5 runs vs data size.

Average fitting time in seconds. Beware log scale!

XGBoost comes out on top, bayesian models do poorly, apparently worse than even most frequent imputer. But variance in scores is quite big and there is not much structure in this dataset anyway, so let’s not lose hope. MAP fmin_powell is particularly bad and terribly slow on top of that, dropping it from further benchmarks.

Let’s try a wider dataset - the cancer network. This one has more structure - that the bayesian network knows up front and xgboost doesn’t - which should give bayesian models an edge.

Cancer network imputation accuracy. 20% missing values

Cancer network imputation time.

That’s more like it! MCMC wins when records are few, but deteriorates when data gets bigger. MCMC models continue to be horribly slow.

And finally, the biggest (27 features!), car insurance network.

Car insurance network imputation accuracy. 20% missing values

Car insurance network imputation time.

Qualitatively same as the cancer network case. It’s worth pointing out that in this case, the Bayesian models achieve at 50 records a level of accuracy that XGBoost doesn’t match until shown more than a thousand records! Still super slow though.

### Conclusions

What have we learned?

1. Bayesian models do relatively better when data is wide (more columns). This was expected. Wide data means bigger network, means there is more information implicit in the network structure. This is information that we hardcode into the Bayesian model, information that XGBoost doesn’t have.
2. Bayesian models do relatively better when data is short (less records). This was also expected, for the same reason. When data is short, the information contained in the network counts for a lot. With more records to train on, this advantage gets less important.
3. pymc’s MAP does poorly in terms of accuracy and is terribly slow, slower than MCMC. This one is a bit of a mystery.
4. For MCMC, longer burn-in gets better results, but takes more time. Duh.
5. MCMC model accuracy deteriorates as data gets bigger. I was surprised when I saw it, but it hindsight it’s clear that it would be the case. Bigger data means more missing values, means higher dimensionality of the space MCMC has to explore, means it would take more iterations for MCMC to reach a high likelihood configuration. This could be alleviated if we first learned the most likely values of the parameters of the network and then used those to impute the missing values one record at a time.
6. XGBoost rocks.

Overall, I count this experiment as a successful proof of concept, but of very limited usefulness in its current form. For any real world application one would have to redo it using some other technology. pymc is just not up to the task.

# Missing Data Imputation With Bayesian Networks in Pymc

This is the first of two posts about Bayesian networks, pymc and missing data. In the first post I will show how to do Bayesian networks in pymc* and how to use them to impute missing data. This part is boring and slightly horrible. In the second post I investigate how well it actually works in practice (not very well) and how it compares to a more traditional machine learning approach (poorly). Feel free to go straight to the second post, it has plots in it.

This post assumes that the reader is already familiar with both bayesianism and pymc. If you aren’t, I recommend that you check out the fantastic Bayesian Methods For Hackers.

* technically, everything in pymc is a Bayesian network, I know

### The problem

We have observed 10 animals and noted 3 things about each of them: - does it swim like a duck? - does it quack like a duck? - is it, in fact, a duck?

It is easy to notice that in this dataset an animal is a duck if and only if it both swims like a duck and quacks like a duck. So far so good.

But what if someone forgets to write down whether the duck number 10 did any quacking or whether the animal number 9 was a duck at all? Now we have missing data. Here denoted by -1

This tells us about the last animal that it is a duck, but the information about swimming and quacking is missing. Nevertheless, having established the rule

we can infer that the values of swims_like_a_duck and quacks_like_a_duck must both be 1 for this animal.

This is what we will try to do here - learn the relationship between the variables and use it to fill in the missing ones.

### The Bayesian solution

To be able to attack this problem, let’s make one simplifying assumption. Let’s assume that we know the causal structure of the problem upfront. That is - we know that swimming and quacking are independent random variables, while being a duck is a random variable that potentially depends on the other two.

This is the situation described by this Bayesian network:

This network is fully characterised by 6 parameters - prior probabilities of swimming and quacking -
$P(swims)$, $P(quacks)$
- and conditional probability of being a duck given values of the other 2 variables -
$P(duck \mid swims \land quacks)$,
$P(duck \mid \neg swims \land quacks)$
- and so on. We don’t know anything about the values of these parameters, other than they must be between $0$ and $1$. The bayesian thing to do in such situations is to model the unknown parameters as random variables of their own and give them uniform priors.

Thus, the network expands:

This is the network describing a single animal, but actually we have observations of many animals, so the full network would look more like this:

There is only one node corresponding to each of the 6 parameters, but there are as many ‘swims’ and ‘quacks’ and ‘duck’ nodes as there are records in the dataset.

Some of the variables are observed (orange), others aren’t (white), but we have specified priors for all the parent variables and the model is fully defined. This is enough to (via Bayes theorem) derive the formula for the posterior probability of every unobserved variable and the posterior distribution of every model parameter.

But instead of doing math, we will find a way to programmatically estimate all those probabilities with pymc. This way, we will have a solution that can be easily extended to arbitrarily complicated networks.

What could go wrong?

### pymc implementation

Disclaimer: this is all hacky and inefficient in ways I didn’t realise it would be when I started working on it. pymc is not the right tool for the job, if you want to do this seriously, in a production environment you should look for something else. pymc3 maybe?

I will now demonstrate how to represent our quack-swim-duck Bayesian network in pymc and how to make predictions with it. pymc was confusing the hell out of me when I first started this project. I will be painstakingly explicit at every step of this tutorial to save the reader some of this confusion. Then at the end I will show how to achieve the same result with 1/10th as many lines of code using some utilities of my invention.

Now the observed variables. pymc requires that we use masked arrays to represent missing values:

This is what a masked array with two missing values looks like:

Quacking and swimming nodes:

And now the hard part. We have to construct a Bernoulli random variable ‘duck’, whose conditional probability given its parents is equal to a different random variable for very combination of values of the parents. That was a mouthful, but all it means is that there is a conditional probability table of ‘duck’ conditioned on ‘swims’ and ‘quacks’. This is literally the first example in every textbook on probabilistic models. And yet, there is no easy way to express this relationship with pymc. We are forced to roll our own custom function.

If you’re half as confused reading this code as I was when I was first writing it, you deserve some explanations.

• ‘swims’ and ‘quacks’ are of type pymc.distributions.Bernoulli, but here we treat them like numpy arrays.

This is @pymc.deterministic’s doing. This decorator ensures that when this function is actually called it will be given swims.value and quacks.value as parameters - and these are indeed numpy arrays. Same goes for all the other parameters.

• earlier we used a pymc random variable for the p parameter of a pymc.Bernoulli but now we’re using a function - duck_probability

Again, @pymc.deterministic. When applied to a function it returns an object of type pymc.PyMCObjects.Deterministic. At this point the thing bound to the name ‘duck_probability’ is no longer a function. It’s a pymc random variable. It has a value parameter and everything.

Ok, let’s put it all together in a pymc model:

aaaand we’re done.

Not really. The network is ready, but there is still the small matter of extracting predictions out of it.

#### Making predictions with MAP

The obvious way to estimate the missing values is with a maximum a posteriori estimator. Thankfuly, pymc has just the thing - pymc.MAP. Calling .fit on a pymc.MAP object changes values of variables in place, so let’s print the values of some of our variables before and after fitting.

optimise the values:

and inspect the results:

The two False bits - in ‘swims’ and ‘quacks’ have flipped to True and the values of the conditional probabilities have moved in the right direction! This is good, but unfortunately it’s not reliable. Even in this simple example pymc’s MAP rarely gets everything right like it did this time. To some extent it depends on the optimisation method used - e.g. pymc.MAP(model).fit(method='fmin') vs pymc.MAP(model).fit(method='fmin_powell'). Despite the warning message recommending ‘fmin’, ‘fmin_powell’ gives better results. ‘fmin’ gets the (more or less) right values for continous parameters but it never seems to flip the booleans, even when it would clearly result in higher likelihood.

#### Making predictions with MCMC

The other way of getting predictions out of pymc is to use it’s main workhorse - the MCMC sampler. We will generate 200 samples from the posterior using MCMC and for each missing value we will pick the value that is most frequent among the samples. Mathematically this is still just maximum a posteriori estimation but the implementation is very different and so are the results.

This should have produced 200 samples from the posterior for each unobserved variable. To see them, we use sampler.trace.

200 samples of the 'P(swims)' paramter - as promised

200 samples of a conditional probability parameter.

swims boolean variable also has 200 samples. But:

quacks has two times 200 - because there were two missing values among quacks observations - and each is modeled as an unobserved variable.

sampler.trace('duck') produces only a KeyError - there are no missing values in duck, hence no samples.

Finally, posterior probability for the missing swims observation:

Great! According to MCMC the missing value in swims is more likely than not to be True!

(sampler.trace('swims')[:] is an array of 200 booleans, counting the number of True and False is equivalent to simply taking the mean).

And the two missing values in quacks are predicted to be False and True - respectively. As they should be.

Unlike the MAP approach, this result is reliable. As long as you give MCMC enough iterations to burn in, you will get very similar numbers every time.

#### The clean way

This was soul-crushingly tedious, I know. But it doesn’t have to be this way. I have created a few utility functions to get rid of the boilerplate - the creation of uniform priors for variables, the conditional probabilities, the trace, and so on. The utils can all be found here (along with some other stuff).

This is how to define the network using these utils:

(there are also versions of make_bernoulli and cartesian_bernoulli_child for categorical variables). And this is how to use it:

Next post: how all this compares to good old xgboost.

# Text Generation With Keras char-RNNs

I recently bought a deep learning rig to start doing all the cool stuff people do with neural networks these days. First on the list - because it seemed easiest to implement - text generation with character-based recurrent neural networks.

watercooling, pretty lights and 2 x GTX 1080 (on the right)

This topic has been widely written about by better people so if you don’t already know about char-RNNs go read them instead. Here is Andrej Karpathy’s blog post that started it all. It has an introduction to RNNs plus some extremely fun examples of texts generated with them. For an in depth explanation of LSTM (the specific type of RNN that everyone uses) I highly recommend this.

I started playing with LSTMs by copying the example from Keras, and then I kept adding to it. First - more layers, then - training with generators instead of batch - to handle datasets that don’t fit in memory. Then a bunch of scripts for getting interesting datasets, then utilities for persisting the models and so on. I ended up with a small set of command line tools for getting the data and running the experiments that I thought may be worth sharing. Here it is.

### And here are the results

A network with 3 LSTM layers 512 units each + a dense layer trained on the trained for a week on the concatenation of all java files from the hadoop repository produces stuff like this:

That’s pretty believable java if you don’t look too closely! It’s important to remember that this is a character-level model. It doesn’t just assemble previously encountered tokens together in some new order. It hallucinates everything from ground up. For example setSchedulerAppTestsBufferWithClusterMasterReconfiguration() is sadly not a real function in hadoop codebase. Although it very well could be and it wouldn’t stand out among all the other monstrous names like RefreshAuthorazationPolicyProtocolServerSideTranslatorPB. Which was exactly the point of this exercise.

Sometimes the network decides that it’s time for a new file and then it produces the Apache software licence varbatim followed by a million lines of imports:

At first I thought this was a glitch, a result of the network is getting stuck in one state. But then I checked and - no, actually this is exactly what those hadoop .java files look like, 50 lines is a completely unexceptional amount of imports. And again, most of those imports are made up.

And here’s a bunch of python code dreamt up by a network trained on scikit-learn repository.

This is much lower quality because the network was smaller and sklearn’s codebase is much smaller than that of hadoop. I’m sure there is a witty comment about the quality of code in those two repositories somewhere in there.

And here’s the result of training on the scalaz repository:

In equal measure elegant and incomprehensible. Just like real scalaz.

Enough with the github. How about we try some literature? Here’s LSTM-generated Jane Austen:

“I wish I had not been satisfied with the other thing.”

“Do you think you have not the party in the world who has been a great deal of more agreeable young ladies to be got on to Miss Tilney’s happiness to Miss Tilney. They were only to all all the rest of the same day. She was gone away in her mother’s drive, but she had not passed the rest. They were to be already ready to be a good deal of service the appearance of the house, was discouraged to be a great deal to say, “I do not think I have been able to do them in Bath?”

“Yes, very often to have a most complaint, and what shall I be to her a great deal of more advantage in the garden, and I am sure you have come at all the proper and the entire of his side for the conversation of Mr. Tilney, and he was satisfied with the door, was sure her situation was soon getting to be a decided partner to her and her father’s circumstances. They were offended to her, and they were all the expenses of the books, and was now perfectly much at Louisa the sense of the family of the compliments in the room. The arrival was to be more good.

That was ok but let’s try something more modern. And what better represents modernity than people believing that the earth is flat. I have scraped all the top level comments from top 500 youtube videos matching the query “flat earth”. Here is the comments scraper I made for this. And here is the neural network spat out after ingesting 10MB worth of those comments

That doesn’t make any sense at all. It’s so much like real youtube comments it’s uncanny.

# Data Matching Part 3: Match Scoring

In this post I will share some tips on the final aspect of data matching that was glossed over in parts 1 and 2 - scoring matches. This is maybe the hardest part of the process, but it also requires the most domain knowledge so it’s hard to give general advice.

### Recap

In the previous posts we started with two datasets “left” and “right”. Using tokenization and the magic of spark we generated for every left record a small bunch of right records that maybe correspond to it. For example this record:

got these two as candidate matches:

And now we need to decide which - if any - is(are) the correct one(s). Last time we dodged this problem by using a heuristic “the more keys were matched, the better the candidate”. In this case the record with Id 'a' was matched on both name and phone number while 'c' was matched on postcode alone, therefore 'a' is the better match. It worked in our simple example but in general it’s not very accurate or robust. Let’s try to do better.

### Similarity functions

The obvious first step is to use some string comparison function to get a continuous measure of similarity for the names rather than the binary match - no match. Levenshtein distance will do, Jaro-Winkler is even better.

and likewise for the phone numbers, a sensible measure of similarity would be the length of the longest common substring:

This makes sense at least if the likely source of phone number discrepancies is area codes or extensions. If we’re more worried about typos than different prefixes/suffixes then Levenshtein would be the way to go.

Next we need to come up with some measure of postcode similarity. E.g. full match = 1, partial match = 0.5 - for UK postcodes. And again the same for any characteristic that can be extracted from the records in both datasets.

With all those comparison functions in place, we can create a better scorer:

This should already work significantly better than our previous approach but it’s still an arbitrary heuristic. Let’s see if we can do better still.

### Scoring as classification

Evaluation of matches is a type of classification. Every candidate match is either true or spurious and we use similarity scores to decide which is the case. This dictates a simple approach:

1. Take a hundred or two of records from the left dataset together with corresponding candidates from the right dataset.
2. Hand label every record-candidate pair as true of false.
3. Calculate similarity scores for every pair.
4. Train a classifier model on the labeled examples.
5. Apply the model to the rest of the left-right candidate pairs. Use probabilistic output from the classifier to get a continuous score that can be compared among candidates.

It shouldn’t have been a surprise to me but it was when I discovered that this actually works and makes a big difference. Even with just 4 features matching accuracy went up from 80% to over 90% on a benchmark dataset just from switching from handpicked weights to weights fitted with logistic regression. Random forest did even better.

One more improvement that can take accuracy to the next level is iterative learning. You train your model, apply it and see in what situations is the classifier least confident (probability ~50%). Then you pick some of those ambiguous examples, hand-label them and add to the training set, rinse and repeat. If everything goes right, now the classifier has learned to crack previously uncrackable cases.

This concludes my tutorial on data matching but there is one more tip that I want to share.

### Name similarity trick

Levenshtein distance, Yaro-Winkler distnce etc. are great measures of edit distance but not much else. If the variation in the string you’re comparing is due to typos ("Bruce Wayne" -> "Burce Wanye") then Levenshtein is the way to go. Frequently though the variation in names has nothing to do with typos at all, there are just multiple ways people refer to the same entity. If we’re talking about companies "Tesco" is clearly "Tesco PLC" and "Manchester United F.C." is the same as "Manchester United". Even "Nadbor Consulting Company" is very likely at least related to "Nadbor Limited" given how unique the word "Nadbor" is and how "Limited", "Company" and "Consulting" are super common to the point of meaninglessness. No edit distance would ever figure that out because it doesn’t know anything about the nature of the strings it receives or about their frequency in the dataset.

A much better distance measure in the case of company names should look at the words the two names have in common, rather than the characters. It should also discount the words according to their uniqueness. The word "Limited" occurs in a majority of company names so it’s pretty much useless, "Consulting" is more important but still very common and "Nadbor" is completely unique. Let the code speak for itself:

The above can be interpreted as the scalar product of the names in the Bag of Word representation in the idf space except instead of the logarithm usually used in idf I used a square root because it gives more intuitively appealing scores. I have tested this and it works great on UK company names but I suspect it will do a good job at comparing many other types of sequences of tokens (not necessarily words).

# Data Matching Part 2: Spark Pipeline

In the last post I talked about the principles of data matching, now it’s time to put them into practice. I will present a generic, customisable Spark pipeline for data matching as well as a specific instance of it that for matching the toy datasets from the last post. TL;DR of the last post:

To match two datasets:

1. Tokenize corresponding fields in both datasets
2. Group records having a token in common (think SQL join)
3. Compare records withing a group and choose the closest match

### Why spark

This data matching algorithm could easily be implemented in the traditional single-machine, single-threaded way using a collection of hashmaps. In fact this is what I have done on more than one occasion and it worked. The advantage of spark here is built-in scalability. If your datasets get ten times bigger, just invoke spark requesting ten times as many cores. If matching is taking too long - throw some more resources at it again. In the single-threaded model all you can do is up the RAM as your data grows but the computation is taking longer and longer and there is nothing you can do about it.

As an added bonus, I discovered that the abstractions Spark forces on you - maps, joins, reduces - are actually appropriate for this problem and encourage a better design than the naive implementation.

### Example data

In the spirit of TDD, let’s start by creating a test case. It will consist of two RDDs that we are going to match. Spark’s dataframes would be even more natural choice if not for the fact that they are completely fucked up.

### Tokenizers

First step in the algorithm - tokenize the fields. After all this talk in the last post about fancy tokenizers, for our particular toy datasets we will use extremely simplistic ones:

Now we have to specify which tokenizer should be applied to which field. You don’t want to use the phone tokenizer on a person’s name or vice versa. Also, tokens extracted from name shouldn’t mix with tokens from address or phone number. On the other hand, there may be multiple fields that you want to extract e.g. phone numbers from - and these tokens should mix. Here’s minimalistic syntax for specifying these things:

And here’s how they are applied:

The result is a mapping of token -> Id in the form of an RDD. One for each dataset:

### Generating candidate matches

Now comes the time to generate candidate matches. We do that by joining records that have a token in common:

Result:

With every match we have retained the information about what it was joined on for later use. We have 4 candidate matches here - 2 correct and 2 wrong ones. The spurious matches are (1, 'c') - Bruce Wayne and Alfred Pennyworth matched due to shared address; (2, 'a') - Bruce Wayne and Thomas Wayne matched because of the shared last name.

Joining the original records back to the matches, so they can be compared:

### Finding the best match

We’re almost there. Now we need to define a function to evaluate goodness of a match. Take a pair of records and say how similar they are. We will cop out of this by just using the join keys that were retained with every match. The more different types of tokens were matched, the better:

We also need a function that will say: a match must be scored at least this high to qualify.

And now, finally we use those functions to evaluate and filter candidate matches and return the matched dataset:

The result:

Glorious.

### Putting it all together

Now is the time to put “generic” back in the “generic data matching pipeline in spark”.

To use it, you have to inherit from DataMatcher and override at a minimum the get_left_tokenizers and get_right_tokenizers functions. You will probably want to override score_match and is_good_enough_match as well, but the default should work in simple cases.

Now we can match our toy datasets in a few lines oc code, like this:

Short and sweet.

There are some optimisations that can be done to improve speed of the pipeline, I omitted them here for clarity. More importantly, in any nontrivial usecase you will want to use a more sophisticated evaluation function than the default one. This will be the subject of the next post.

# Data Matching Part 1: Algorithm

In this and the next post I will explain the basics of data matching and give an implementation of a bare-bones data matching pipeline in pyspark.

### Datamatching

You have a dataset of - let’s say - names and addresses of some group of people. You want to enrich it with information scraped from e.g. linkedin or wikipedia. How do you figure out which scraped profiles match wich records in your database?

Or you have two datasets of sales leads maintained by different teams and you need to merge them. You know that there is some overlap between the records but there is no unique identifier that you can join the datasets on. You have to look at the the records themselves to decide if they match.

Or you have a dataset that over that years has collected some duplicate records. How do you dedup it, given that the data format is not consistent and there may be errors in the records?

All of the above are specific cases of a problem that can be described as: Finding all the pairs (groups) of records in a dataset(s) that correspond to the same real-world entity.

This is what I will mean by data matching in this post.

This type of task is very common in data science, and I have done it (badly) many times before finally coming up with a generic, clean and scalable solution. There are already many commercial solutions to specific instances of this problem out there and I know of at least two startups whose main product is data matching. Nevertheless for many usecases a DIY data matching pipeline should be just as good and may be easier to build than integration with an external service or application.

### An example

The general problem of data matching will be easier to discuss with a specific example in mind. Here goes:

You work at a company selling insurance to comic book characters. You have a database of 50.000 potential clients (these are the first 3):

and you just acquired 400.000 new leads:

You need to reconcile the two tables - find which (if any) records from the second table correspond to records in the first. Unfortunately data in both tables is formatted differently and there are some typos in both (“Burce”, “Ogtham”). Nevertheless it is easy for a human being to figure out who is who just by eyeballing the two datasets. $Id = 1$ from the first table and $Id = a$ from the second clearly refer to the same person - Bruce Wayne. And $Id = 3$ matches $Id = b$ - Bruce Banner. The remaining records - Thomas Wayne and Alfred Pennyworth don’t have any matches.

Now, how to do the same automatically and at scale? Comparing every record from one table to every one in the other - $50k \times 400k = 2 \times 10^9$ comparisons - is out of the question.

### The impatient programmer approach

Like any red-blooded programmer, when I see a list of things to be matched to other things, I immediately think: hashmaps. My internal dialogue goes something like this:

• Let’s make a lookup {name -> Id} using the first table and iterate over the second table
• But the names are in a different format, they won’t match
• OK, let’s normalize the names first, strip punctuation, put the words in alphabetical order
• Still won’t match - Bruce Banner is abbreviated to B. Banner in one of the tables
• Fair enough. Let’s have two separate entries in the lookup for ‘Bruce’ and ‘Banner’ both pointing to Id = 3
• but we don’t want to match just any ‘Bruce’ or any ‘Banner’ to this guy
• Yeah, will have to use the remaining attributes to verify if a match is legit. Generate potential matches by looking up every word in the ‘name’ field, and validate by checking if the other fields look similar. This should work
• what if the name is empty or misspeled, but all other fields match up perfectly? Shouldn’t this be still considered a match?
• Now you’re being difficult on purpose! FINE. Let’s have multiple lookups - name -> Id, phone -> Id, etc. A pair of records will be considered a potential match if they have at least one of those in common

This is beginning to sound unwieldy, but it’s basically the correct approach and - I strongly suspect - the only workable approach. At least as long as we’re not taking the hashmaps too literally. But let’s reformulate it in slightly more abstract terms before diving into implementation.

### Generalising

Datamatching must consist of two stages:

1. Generate candidate matches
2. For every record, pick the best match among the candidates

In this post I will assume that we have a good way of evaluating candidate matches (step 2) and concentrate only on step 1. In fact 2 is crucial and usually harder than 1 but it’s very case-specific. More on that topic next time.

#### Generating candidates

When is a pair of records a good candidate for a match? When the records have something in common. What? For example one of the fields - like phone number or name or address. That would definitely suffice but it’s too restrictive. Consider Bruce Wayne from our example. In the first table:

And in the second table:

Not a single field in common between these two records and yet they clearly represent the same person.

It is tempting to try to normalise phone numbers by stripping area extensions, fix misspeled names, normalise order of first-, middle-, last name, etc. And sometimes this may be the way to go. But in general it’s ambiguous and lossy. There will never be a normalisation function that does the right thing for every possible version of the name:

What we can do instead is extract multiple tokens (think: multiple hashes) from the name. A pair of records will be considered a candidate match if they have at least one token in common.

We can for example just split the name on whitespaces:

and have this record be matched with every “Bruce” every “Thomas” and every “Wayne” in the dataset. This may or may not be a good idea depending on how many “Bruces” there are in this specific dataset. But tokens don’t have to be words. We can try bigrams:

Or we can try forming initials:

If we did that, then for instance both “Bruce Wayne” and “Burce T. Wayne” would get “B. Wayne” as one of the tokens and would end up matched as a result.

If we tokenize by removing vowels, that would go a long way toward fixing typos and alternative spellings while introducing few false positives:

There are also algorithms like Soundex that tokenize based on how a word sounds regardless of how it’s spelled. Soundex gives the same result for “Bruce” and “Broose” and “Bruise” or for “John” and “Jon”. Very useful given that a lot data entry is done by marketers who talk to people (and ask their names) over the phone.

Finally, nothing stops us from using all of the above at the same time:

With this, all the different ways of spelling “Bruce Wayne” should get at least one token in common and if we’re lucky few other names will.

This was an overview of name tokenization. Other types of data will require their own tokenization rules. The choice of tokenization logic necessarily depends on the specific data type and dataset but these the general principles:

• tokenization should try to fix ambiguities in the data. If there are multiple ways a real world entity can be represented in the dataset, good tokenizer would give all of them the same token
• tokens should be specific enough records representing different real world entities only rarely get a token in common

One not name-related example: phone numbers. Since people enter phone numbers in databases in one thousand different formats with all kinds of rubbish area codes and extensions you shouldn’t count on raw phone numbers matching perfectly. An example of a sensible tokenizer is one that first strips all non-digit characters from the phone number then returns the last 8 digits.

Or to guard against people putting extensions at the end of phone numbers, we can extract every consecutive 8 digits:

This should catch any reasonable way of writing the number while still having very low likelihood of a collision.

#### TL;DR

To match two datasets:

1. Tokenize corresponding fields in both datasets
2. Group records having a token in common (think SQL join)
3. Compare records withing a group and choose the closest match

Coming up: how to implement this in spark.

# Dear Recruiter

I’ve had a lot of experience with tech interviews and recruiters in the past and I’m likely to have a lot more in the future. A non-negligible fraction of that experience ranged from “mildly annoying” to “exasperating”.

Here’s a list of real examples of irritating recruiter behaviors together with guidlines on how I expect a reasonable person to act instead.

Before we start you must understand where I’m coming from. I’m not constantly looking for new jobs. But when I do, I apply en masse and then I have to deal with many recruiters at once. Each one constantly calling me in the office or at home (what is it with your obsession with phone calls? Why not email?). This in itself can get annoying but on top of that there are other more serious offenses. I am usually extremely agreeable both in person and over the phone. A blog post is my way of expressing complaints that I wouldn’t dare make in a conversation with you. It’s also a form of catharsis so forgive me if I get somewhat snarky and don’t take offense if you personally are not guilty of the sins mentioned here.

### When you contact me about a new role

I have an exciting opportunity with a global company, one of the leaders in its field. When can we arrange a call?

This tells me nothing. Everyone is either a “leader in their field” or an “exciting startup” or a “global brand”. I’m getting 5 of these a day. If I am to commit any time to it I first need to know:

• type of role (data scientist? developer? devops? junior? lead? manager?)
• ballpark salary
• skills they are looking for

Ideally, if possible, also

• what are they trying to build/do

This information is typically enough for me to decide if I’m interested or not. I will let you know if I am. I really don’t need to listen to you talk about their nice offices and what fancy university has the CTO attended. If I’m interested in these things I will look them up myself. This conversation is not going to change my mind one way or the other.

If you want to chat with me because you need to vet me before passing my CV on to the company please say so and indicate clearly what your decision is afterwards. And don’t do this bait-and-switch on me where I express interest in one position and you call me to discuss but only try to peddle another.

In short, this is how it’s going to work:

1. you email me basic details of the role (or multiple roles)
2. I email you back with my latest CV if I’m interested or with questions if I have any
3. you send my details over the company and try to arrange an interview

If you absolutely need to hear my voice and vet me, say so. If I’m interested in the role, we can get it over with once.

#### Before the interview

Let me know what is the best time to call you so that I can talk you through the interview process

Why? It’s a job interview not open heart surgery. There will be a 20 minutes phone call with HR, then an hour with some technical person then 3 hours of technical on-site interviews then a brief chat with some higher-up. Or a homework assignment, then technical phone call, then on-site, then HR. Or some other configuration. Whatever the case, you could’ve explained it in the email and save us both time. Frankly, I don’t care what the format is. All I need from you is the time and place and it is definitely possible to send those by email.

I will call you before the interview to give you a heads up

No.

Are you interviewing with any other companies? Tell me, so we can adjust the schedule so that you don’t miss this opportunity

What you mean is “we can adjust the schedule so that you don’t get the chance to interview with anyone else” (more on that later). Thank you very much. I don’t want to lie to you, so I’m just not going to tell you anything. If I actually need to speed things up because of other interviews, I will let you know. By email.

How it is going to work:

1. you give me time and place
2. I confirm that I will show up

### After the interview

Please call me after the interview to tell me how did it go …

It went well. Probably. Or maybe not. Either way it doesn’t really matter how I think it went, does it?

… what was it like …

It was like every other interview I’ve been to. First they introduced themselves and the company, then we talked about my resume, then about my motivation and finally they gave me some vaguely job-related problems to solve and I solved them. What else did you expect?

… how did you like them

I liked them fine. Or maybe I loved them. Or maybe I thought they were boring. You won’t find out by calling though, because I’m only gonna feed you some enthusiastic sounding platitudes because I want to get to the next stage in this process. There is no upside for me in telling you that I thought the interviewers were boring and dumb even if it was true. So just don’t ask. In the unlikely event that I hated the interview enough to make me want to withdraw my application, I promise to let you know. By - you guessed it - email.

How it’s going to work:

1. if you have any feedback from the interviewers or a decision, you will email it to me
2. if I want to take myself out of the process I will inform you also by email

### The final stretch

So I made it through all the rounds and I’m expecting to hear back from the company. You call me saying you have some positive initial feedback, then we have this conversation:

You: If they come back with an offer of £n are you going to accept?
Me: I don’t know. I’ll need a couple of days to think about it.
You: Why wouldn’t you accept it? What is wrong with the offer?
Me: Nothing is wrong with it. I just need some time to consider my options. Not very long, just the weekend.
You: If there is nothing wrong with the offer then you should take it. I need to get back to them and tell them that you are going to take the offer. Otherwise they will think that you are not really motivated to work with them and they will not make the offer.
Me: What are you talking about?! They know I’m motivated. But this is a serious decision and I will not make it without proper consideration.
You: They have deadlines, you know. They can’t wait forever.
Me: It’s only two days!
You: If I don’t tell them you will accept, they will keep interviewing and they can find someone else. This is a very buyoant market!

This is bullshit and you know it. This is not a heist movie and you’re not looking for a last minute replacement safe cracker. The job ad has been out for months. I think they can wait two more days.

One time this happened to me right after the interview. The recruiter didn’t even wait to hear from the company in question. He based this whole routine on my reporting that the interview went well and that we discussed money at the end. This was also a case of him putting me up for a job that paid 10% less than the one I held at the time even though I explicitly required that my new position pays 10% more. I only found out during the final interview.

Another recruiter upon learning that I’m interviewing with other companies got my final interview cancelled. My first interview was on Monday and the final one was supposed to take place a week later. On Wednesday he asked me about other interviews, then we’ve had the bullshit “you’ve got to accept” conversation. When I didn’t budge, the recruiter got the company to make me an offer that evening, skipping the final interview. The offer came with an expiration date on 12am the following day! 4 days before the planned interview. He even blurted out that the insane timing was because they were afraid they would lose me, before covering up with the deadlines and buyoant markets bs. That they have to resort to this type of tactics is all the proof I need that the market for data scientists isn’t buyoant at all. Obviously they were afraid I would accept another offer. And the only reason they were afraid was that the recruiter tipped them off.

I understand now that we’re not allies. We are not exactly oponents but you’re definitely playing for a different team.

So I am not going to make it easy for your team. I will not get caught up in the false sense of urgency you’re trying to create (one recruiter tried to get me to come to the office and sign the contract on a Sunday) … ever again. I will not fall for scare tactics. I will not reveal any information about other interviews I may be having. Here’s what will happen:

1. you will only contact me when you have an offer
2. I will take my time considering that offer - and I will tell you how long exactly. Shorter if it’s a “start immediately” fixed term contract, longer if it’s a permanent position, but I will never be bullied into committing to a contract I haven’t even seen
3. I will ignore any exploding offer deadlines so don’t even bother. The only exception is if the deadline has been established up front before any interviews

It’s worth noting that I have never experienced anything resembling the “bullshit conversation” when I dealt with a potential employer directly. When applying through recruiters it happened every time.

### Final notes

• no I will not have coffee with you. Why would you even think I’d like that?
• I don’t need you to tell me that “if you like the job you should be flexible about the money” or that “these companies are not going to wait forever, better apply now”
• in general I would appreciate if you keep your obviously-biased career advice to yourself
• any tips regarding my resume or specific interviews are very welcome
• if you can write an email to arrange a call to tell me X, you can just tell me X in the email

If these ground rules seem unreasonable to you, then save us both the trouble and don’t contact me.

Yours Exasperatedly

### P.S.

I didn’t intend to make it read like I’m bragging about all the offers I’m getting like I’m God’s gift to data science who can get any job he wants. It’s just that I’ve interviewed a lot over the years so in addition to dozens of rejections I have had several successes too. I’m only describing here the more successful examples because in case of rejection recruiters tend to mercifully leave me alone.

# Deepwalking With Companies

I have blogged about the wide usefulness of topic models and I have benchmarked word-embedding-assisted classification on Reuter’s benchmark. This time I experiment with these ideas using a real world and decent sized dataset - the graph of UK/Irish companies. I have done this during my “10% time” at DueDil (it’s like google’s “20% time”, except it exists).

### Graph of companies

There are 4 million active companies in the UK and Ireland. DueDil collects all kinds of information about them - financials, legal structures, contact info, company websites, blogs, news appearences etc. All of it is presented to our users and some of it also serves as input to machine learning tasks - like classifying companies into industries.

One very interesting dataset that remains underutilised (AFAIK by anyone, not just DueDil) is the network of connections of companies and directors.

You can tell a lot about a company just by looking at its directors. That is - if you know anything about these people. At DueDil we don’t know much more than just their identities. This would be rather useless in the context of a single company. But there are millions of companies and people who serve as their directors more often then not do it many times in their careers at different companies. Knowing that the director’s name is Jane Brown may be useless, but knowing that the director previously held similar positions at three different tech startups is highly relevant. And this is just one director out of many and one type of relationship.

More generally, one can think about companies as nodes in a graph. Two companies are connected iff there is a person who has served as a director at both of them (not necessarily at the same time). I will call this the company graph. Here’s a part of the graph containing DueDil.

DueDil is connected to Founders For Good Ltd because our CEO Damian Kimmelman is also a director at the other company.

It is intuitive that the position of a company in this graph tells us something about the company. It is however difficult to do anything with this information unless it is somehow encoded into numbers.

### Company2Vec

This is where word embeddings come in. As I mentioned previously, it is possible to apply Word2Vec to a graph to get an embedding of graph nodes as real-valued vectors in a procedure called DeepWalk. The idea is very simple:

1. Construct a bunch of random walks on the graph
2. Feed the random walks into Word2Vec

A random walk is just a sequence of nodes, where the next node is always one of the neighbours of the previous node, chosen at random. Think: Duedil -> Founders For Good Ltd -> Omio Limited.

Word2Vec accepts a collection of documents - where every document is a list of tokens (strings). Here company Id’s play the role of tokens and random walks play the role of documents. It all checks out.

To limit the size of the graph for this proof of concept, I have applied this procedure only to the 2.2 million companies that

1. are active
2. have at least one edge (director in common) to another active company

I generated 10 random walks starting at every company, the length of each walk was 40. Training Word2Vec with gensim on this corpus of $10 \times 40 \times 2200000 = 8.8 \times 10^8$ tokens took over 11h. It also took a machine with 40gb of RAM before it stopped crashing even though the random walks were generated on-line.

Finally I got some vectors out of it - one per company. These vectors themselves were the goal of this project (they can serve as features in ML), but I also made some plots to verify that the algorithm is working as advertised.

### Pretty pictures with t-SNE and Bokeh

The embedding produced by DeepWalk was 100-dimensional in this case, so I had to do some dimensionality reduction before trying to visualize the vectors. t-SNE is perfect for this kind of thing. Here’s a sample of 40000 company vectors embedded in 2D with t-SNE. You can move or zoom in the plot or hover over the dots to see the names of the corresponding companies.

Bokeh Plot

It worked! You can immediately see that there is some structure in the data and t-SNE has picked up on it (and this is only a tiny sample - 2% of all the datapoints). What does this structure mean? After the graph has beed transformed with DeepWalk and then t-SNE, the position of a company in this plot doesn’t have a simple interpretation but it’s clear that groups of highly interconnected companies will correspond to clusters of points in this plot. And it’s easy to verify just by looking at the names of the companies that this is the case.

Take the big blob in the upper left corner - the companies there:

• Edwards Macliammoir Dublin Gate Theatre Productions Limited
• Humanist Association of Ireland Limited
• Kildare Street Management Limited
• Shannon Airport Authority Limited

We have discovered the cluster of irish companies! And if you zoom in on the long, narrow appendage sticking out of this cluster towards bottom left - you’ll see companies like:

• Tempelhof Aircraft Leasing (Ireland) Limited
• Gallic Dragon Aviation Limited
• Aergen Aircraft Ten Limited

… and hundreds more. This is not even cherry-picked. I hereby declare the discoverery of the Irish Aviation Peninsula.

Slightly up and to the right of center there is a smaller scottish cluster recognizable through such companies as

• Caledonian Sausage Company
• Edinburgh Tattoo Productions Limited
• Dundee Ice Arena

There are many other smaller clusters and it’s actually a fun exercise to try to pinpoint exactly what do the companies in a cluster have in common.

#### Now in color!

This was fun if somewhat grim looking. Let’s try to add some color to the plot. The original goal of this project was to get graph-derived features for industry classification. Let’s try using different colors to denote different industries (based on SIC codes). If DeepWalk coordinates are predictive of the industry a company is in, we should expect to see same-colored dots (companies in the same industry) clustering together in the plot. Does this actually happen?

Bokeh Plot

A little bit, yes.

Mostly everything is a big reddish mess (“services” is the most popular category). But there are indeed some clusters. Right of center we can see a medium sized pink blob of insurance companies:

• US Risk (UK) Newco Limited
• Zenith Insurance Management UK Limited
• North Star Underwriting Limited

Below it and to the left lies another, this one green:

• Timeless Films Limited
• Hercules Productions Ltd
• Koninck studios PTE Limited

Clearly this is a cluster of film companies (plus other media). If you look more closely you will discover that this is actually the cluster of London based film companies. Nearby there is a smaller green cluster of media companies from the rest of England and another one for Wales. These are less clearly delimited and partly obscured by the red dots of “Services” companies. There are many others, but they are sometimes so tight, they appear as a single dot in the plot.

This is more noisy than I hoped for but it’s definitely working. Would definitely improve accuracy of industry classification if used with other stronger features. Plus you can learn interesting things from it just by looking at the plot. Like the fact that film production companies are closely connected to each other and relatively unconnected to the rest of the world. Or that London is a different country as far as media companies are concerned.

#### Bonus: keyword based company embedding

Having all this t-SNE and Bokeh niceness in place I couldn’t resist applying it to another interesting dataset - keywords. Keywords are a set of industry related tags that DueDil has for millions of companies. They are things like “fishing” or “management consulting” or “b2b”. A company usually has between a few and a few dozen of them.

A byproduct of the pipeline that extracts keywords for companies is a Word2Vec embedding of the keywords. I used this embedding to create an embedding of companies. This was done simply by averaging all the vectors corresponding to a company’s keywords. I ran the resulting vectors through t-SNE and here’s what it looks like:

Bokeh Plot

I shouldn’t be surprised that keywords - which were picked to be industry related - predict the industry really well. But I was blown away by the level of detail preserved by t-SNE. There are recognizable islands for everything. There is a golden Farmers Archipelago and a narrow blue Dentist Island south from Home Care Island. There is a separate Asian Island in the Restaurant Archipelago - go see for yourself.

This was fun. Long live Word2Vec and t-SNE!

# Data Engineers Will Hate You - One Weird Trick to Fix Your Pyspark Schemas

I will share with you a snippet that took out a lot of misery from my dealing with pyspark dataframes. This is pysparks-specific. Nothing to see here if you’re not a pyspark user. The first two sections consist of me complaining about schemas and the remaining two offer what I think is a neat way of creating a schema from a dict (or a dataframe from an rdd of dicts).

#### The Good, the Bad and the Ugly of dataframes

Dataframes in pyspark are simultaneously pretty great and kind of completely broken.

• they enforce a schema
• you can run SQL queries against them
• faster than rdd
• much smaller than rdd when stored in parquet format

On the other hand:

• dataframe join sometimes gives wrong results
• pyspark dataframe outer join acts as an inner join
• when cached with df.cache() dataframes sometimes start throwing key not found and Spark driver dies. Other times the task succeeds but the the underlying rdd becomes corrupted (field values switched up).
• not really dataframe’s fault but related - parquet is not human readable which sucks - can’t easily inspect your saved dataframes

But the biggest problem is actually transforming the data. It works perfectly on those contrived examples from the tutorials. But I’m not working with flat SQL-table-like datasets. Or if I am, they are already in some SQL database. When I’m using Spark, I’m using it to work with messy multilayered json-like objects. If I had to create a UDF and type out a ginormous schema for every transformation I want to perform on the dataset, I’d be doing nothing else all day, I’m not even joking. UDFs in pyspark are clunky at the best of times but in my typical usecase they are unusable. Take this, relatively tiny record for instance:

the correct schema for this is created like this:

And this is what I would have to type every time I need a udf to return such record - which can be many times in a single spark job.

#### Dataframe from an rdd - how it is

For these reasons (+ legacy json job outputs from hadoop days) I find myself switching back and forth between dataframes and rdds. Read some JSON dataset into an rdd, transform it, join with another, transform some more, convert into a dataframe and save as parquet. Or read some parquet files into a dataframe, convert to rdd, do stuff to it, convert back to dataframe and save as parquet again. This workflow is not so bad - I get the best of both worlds by using rdds and dataframes only for the things they’re good at. How do you go from a dataframe to an rdd of dictionaries? This part is easy:

It’s the other direction that is problematic. You would think that rdd’s method toDF() would do the job but no, it’s broken.

actually returns a dataframe with the following schema (df.printSchema()):

It interpreted the inner dictionary as a map of boolean instead of a struct and silently dropped all the fields in it that are not booleans. But this method is deprecated now anyway. The preferred, official way of creating a dataframe is with an rdd of Row objects. So let’s do that.

prints the same schema as the previous method.

In addition to this, both these methods will fail completely when some field’s type cannot be determined because all the values happen to be null in some run of the job.

Also, quite bizarrely in my opinion, order of columns in a dataframe is significant while the order of keys is not. So if you have a pre-existing schema and you try contort an rdd of dicts into that schema, you’re gonna have a bad time.

#### How it should be

Without further ado, this is how I now create my dataframes:

This doesn’t randomly break, doesn’t drop fields and has the right schema. And I didn’t have to type any of this StructType([StructField(... nonsense, just plain python literal that I got by running

As an added bonus now this prototype is prominently displayed at the top of my job file and I can tell what the output of the job looks like without having to decode parquet files. Self documenting code FTW!

#### How to get there

And here’s how it’s done. First we need to implement our own schema inference - the way it should work:

Using this we can now specify the schema using a regular python object - no more java-esque abominations. But this is not all. We will also need a function that transforms a python dict into a rRw object with the correct schema. You would think that this should be automatic as long as the dict has all the right fields, but no - order of fields in a Row is significant, so we have to do it ourselves.

And finally:

# Text Classification With Word2Vec

In the previous post I talked about usefulness of topic models for non-NLP tasks, it’s back to NLP-land this time. I decided to investigate if word embeddings can help in a classic NLP problem - text categorization. Full code used to generate numbers and plots in this post can be found here: python 2 version and python 3 version by Marcelo Beckmann (thank you!).

#### Motivation

The basic idea is that semantic vectors (such as the ones provided by Word2Vec) should preserve most of the relevant information about a text while having relatively low dimensionality which allows better machine learning treatment than straight one-hot encoding of words. Another advantage of topic models is that they are unsupervised so they can help when labaled data is scarce. Say you only have one thousand manually classified blog posts but a million unlabeled ones. A high quality topic model can be trained on the full set of one million. If you can use topic modeling-derived features in your classification, you will be benefitting from your entire collection of texts, not just the labeled ones.

#### Getting the embedding

Ok, word embeddings are awesome, how do we use them? Before we do anything we need to get the vectors. We can download one of the great pre-trained models from GloVe:

and use load them up in python:

or we can train a Word2Vec model from scratch with gensim:

#### The (python) meat

We got ourselves a dictionary mapping word -> 100-dimensional vector. Now we can use it to build features. The simplest way to do that is by averaging word vectors for all words in a text. We will build a sklearn-compatible transformer that is initialised with a word -> vector dictionary.

Let’s throw in a version that uses tf-idf weighting scheme for good measure

These vectorizers can now be used almost the same way as CountVectorizer or TfidfVectorizer from sklearn.feature_extraction.text. Almost - because sklearn vectorizers can also do their own tokenization - a feature which we won’t be using anyway because the benchmarks we will be using come already tokenized. In a real application I wouldn’t trust sklearn with tokenization anyway - rather let spaCy do it.

Now we are ready to define the actual models that will take tokenised text, vectorize and learn to classify the vectors with something fancy like Extra Trees. sklearn’s Pipeline is perfect for this:

#### Benchmarks

I benchmarked the models on everyone’s favorite Reuters-21578 datasets. Extra Trees-based word-embedding-utilising models competed against text classification classics - Naive Bayes and SVM. Full list of contestants:

• mult_nb - Multinomial Naive Bayes
• bern_nb - Bernoulli Naive Bayes
• svc - linear kernel SVM
• glove_small - ExtraTrees with 200 trees and vectorizer based on 50-dimensional gloVe embedding trained on 6B tokens
• glove_big - same as above but using 300-dimensional gloVe embedding trained on 840B tokens
• w2v - same but with using 100-dimensional word2vec embedding trained on the benchmark data itself (using both training and test examples [but not labels!])

Each of these came in two varieties - regular and tf-idf weighted.

The results (on 5-fold cv on a the R8 dataset of 7674 texts labeled with 8 categories):

SVM wins, word2vec-based Extra Trees is a close second, Naive Bayes not far behind. Interestingly, embedding trained on this relatively tiny dataset does significantly better than pretrained GloVe - which is otherwise fantastic. Can we do better? Let’s check how do the models compare depending on the number of labeled training examples. Due to its semi-supervised nature w2v should shine when there is little labeled data.

That indeed seems to be the case. w2v_tfidf’s performance degrades most gracefully of the bunch. SVM takes the biggest hit when examples are few. Lets try the other two benchmarks from Reuters-21578. 52-way classification:

Qualitatively similar results.

And 20-way classification:

This time pretrained embeddings do better than Word2Vec and Naive Bayes does really well, otherwise same as before.

#### Conclusions

1. SVM’s are pretty great at text classification tasks
2. Models based on simple averaging of word-vectors can be surprisingly good too (given how much information is lost in taking the average)
3. but they only seem to have a clear advantage when there is ridiculously little labeled training data

At this point I have to note that averaging vectors is only the easiest way of leveraging word embeddings in classification but not the only one. You could also try embedding whole documents directly with Doc2Vec. Or use Multinomial Gaussian Naive Bayes on word vectors. I have tried the latter approach but it was too slow to include in the benchmark.

Update 2017: actually, the best way to utilise the pretrained embeddings would probably be this https://blog.keras.io/using-pre-trained-word-embeddings-in-a-keras-model.html I shall add this approach to the benchmark when I have some time.

1. Sometimes pretrained embeddings give clearly superior results to word2vec trained on the specific benchmark, sometimes it’s the opposite. Not sure what is going on here.

Overall, we won’t be throwing away our SVMs any time soon in favor of word2vec but it has it’s place in text classification. Like when you have a tiny training set or to ensemble it with other models to gain edge in Kaggle.

Plus, can SVM do this:

prints