DS Toolbox - Topic Models

Nov 29th, 2015 2:59 pm

If you’re not primarily working with NLP you may not have been paying attention to topic modeling and word embeddings. In this post I intend to convince you that you should.

Topic models

Topic models are a set of models in NLP that discover common themes in a collection of documents. You give it a list of texts and it comes up with a bunch of topics and maps every document to a topic or a mixture of topics. Here you can see a visualization of a topic model applied to wikipedia articles. As you can see, it picks up similar kinds of themes in the texts that a human being would notice. Take topic 5 for instance. Its top relevant words are “war”, “force”, “army”, “attack”, “military” and top articles: “Second Boer War”, “Erwin Rommel”, “Axis powers”, “Vietnam war”. Pretty neat. Most topic models (like the most popular LDA produce a vector of numbers for every text - the distribution of topics and a similar vector for every word - the affinity of the word to every topic.

Word (and document) embeddings

Word embeddings are related to topic models but instead of mapping a text to a mixture of topics they learn to map words to real valued fixed-size vectors. The mapping is designed to preserve the semantic structure so that the distance between vectors corresponds to distance in meaning of words. An additional property of these embeddings (probably unintended) is that you can do a sort of algebra on words - like this:

$vector("Paris") - vector("France") + vector("Italy") \approx vector("Rome")$

- an examples taken from the most famous word embedding algorithm -word2vec. How cool is that? If you want to play with the vectors yourself and discover new fun analogies ($Bush - Iraq + Vietnam \approx Nixon$ !!!) you can download pretrained vectors from word2vec or GloVe.

Why you should care

Unsupervised algorithms learning about analogies between real-world entities is pretty cool but it obscures the wider applicability of these algorithms. The topic modeling and embedding algorithms were invented in the context of NLP but most of them (including LDA and wor2vec) don’t have any inherent knowledge about natural languages and can be applied just as well to sequences of items from a discrete set of any kind. This is huge. Sequences (or sets) of items are notoriously hard to work with. Most popular ML algorithms expect fixed-size vectors as inputs. If your inputs are sequences you can either:

do one-hot encoding and then use any old ML algo you want (like a random forest). Unfortunately one-hot discards all information about the order of items. More importantly, as the size of the vocabulary grows beyond thousands (which is still tiny as far as vocabularies go) your random forest will take forever to train and overfit.
use naive bayes. NB works surprisingly well but only for classification and only when the problem is - well, naive. It also utilizes the item order information to a very limited extent via n-grams.
use a recurrent neural network This can actually be very effective in some cases but I don’t think it would work well with a vocabulary size in the thousands. Even the character-level language models take days to train properly on a GPU (but what incredible results it produces!). I believe NNs of all kinds will get better and easier to use but as of now this is not a practical solution to our problem at all.

This is where word embeddings come in. Just run word2vec on your sequences of items and you’ll get a reasonably-dimensional representation of every item. You can then take the mean of all vectors in a sequence to get a representation of sequences. Or run doc2vec and you’ll get a vector for each sequence directly. Or if you need them clustered - run LDA. LDA’s word-topic coefficients can also be used as word embedding but they don’t do nearly as good of a job at it as word2vec. Same goes for LDA’s text-topic coefficients as a document to vector mapping.

This is it. This is what topic modeling buys you. It’s a generic clustering and feature extraction technique that works on sequences (or sets) of items from a discrete vocabulary. Does it actually work in practice? I don’t know of a lot of examples of people using it but I know a few.

Page jumps

As I have already described in my mini-guide to data science interviews (question “Predicting page jumps”) it can be used to model users jumping between pages of a web application. Here a page plays the role of a word and a user journey is a sentence. You can (and from what I understood from the interview - they [the company I interviewed with] do) use topic modeling to segment users based on their journeys and extract features for them to predict page jumps.

Ad clicks

In a similar vein, topic modeling has been used as a feature extraction technique in the prediction of ad clicks. The paper concentrates on the authors’ special brand of topic modeling but the idea is simple and can be used with any topic model for example LDA. They treat hosts (the websites with banners on them) as words and users who visit the websites as documents. Let’s say John visits youtube then google then wikipedia then youtube again then tmz then guardian. This gives us a sequence [“youtube”, “google”, “wikipedia”, “youtube”, “tmz”, “guardian”]. Topic modeling is applied to the set of all users and the result is a set of topics (“media”, “business” and “drive” in the paper) and a decomposition of every user into those topics. So for our John we should expect something like {media: 0.8, business: 0.2, drive: 0}. This is interesting in itself and constitutes a great feature vector that you can feed into a regression predicting clicks - which is exactly what the authors did.

DeepWalk

Feature extraction from graphs is even harder than for sequences. Say you want to classify youtube videos. You have a labeled training set and a set of features for every video. But you also know that some videos link to other videos - forming a graph. Information about the position of the video in this graph is bound to be useful for classifiaction (think “oh, I’m in that weird part of youtube again”) but how do you use it. Authors of the DeepWalk paper compared several approaches and the best turned out to be a trick involving word2vec that they invented. This time the role of a word is played by a node in the graph (a youtube video). What plays the role of a sentence? For that a collection of short random walks on the graph is constructed - starting at every node of the graph. Think about what happens when a youtube video ends and another one starts playing, and then another one and another. Do this a few times for every video on youtube and you have a corpus of texts. Authors of the paper applied word2vec to those sequences to get vector embeddings for videos and then used these vectors as features in classification. It worked better then all other approaches - even ones that use global features of the graph. Awesome.

Update July 2016
I have tried this very algorithm on the graph of UK and Irish companies, here are the results

Frequent itemsets / collaborative filtering

Association rule learning is a popular task in the context of retail. If a customer bought butter and bread, what other item are they likely to buy? The usual approach is to count instances of people buying {bread, butter, X} and divide that number by the count of people buying {bread, butter} - this estimates the probability of buying X. Then you can find the X that maximizes the probability and do something with it (suggest it to the user as they are about to checkout perhaps). This is a bit crude, not very robust and it doesn’t provide any insight, just the prediction. What you can do instead is to run (you guessed it) topic modeling with items playing the role of words and baskets playing the role of sentences. Word2vec will give you vector representations of both items and baskets which will allow you to use more sophisticated algorithms for predicting the next item. You will also get a segmentation of all users and all items for free. To understand why this is superior consider this: topic modeling will easily pick up on the cluster of vegetarian buyers and then the model will know not to recommend the buyer pork chops even if they bought three other items that usually go with bacon - this is something frequent itemset algorithms are incapable of. When a new type of soy-based pork substitute appears on the shelves, the algorithm will also take much less time to figure out that it belongs to the vegetarian cluster and is analogous to meat. I don’t actually know if anyone in retail is doing topic modeling on baskets but if they don’t, they should. I’ll do it myself if I can find a free dataset with retail baskets.

Update July 2016
Called it. People are totally doing that now here and here. I don’t know why in the above text I fixated on frequent itemsets - which is just a specific, outdated way of doing collaborative filtering.

If you know of any other cool applications of topic modeling to non-NLP problems let me know.

If you want to play with topic models yourself I wholehartedly recommend gensim. I tried also MLLib but its word2vec implementation required 3 times as much RAM (for each of 10 cores I used) and still was about ten times slower than gensim.

Lead Scoring Without Negative Examples

Aug 1st, 2015 8:49 pm

How do you train a binary classifier when you have only positive-labeled training examples? Impossible? Maybe. But sometimes you can do something just as good. Let’s start from the beginning…

Lead generation

Everyone and their mum in the b2b sector is trying to use machine learning for lead generation. For our purposes lead-gen means selecting from the millions of active companies in your country the ones that are most likely to buy your product once contacted by your sales team. Needless to say lead generation is extremely serious business and constitutes a market sector in its own right - a type of b2b2b if you will.

One popular lead-gen strategy is to look for leads among companies similar the ones that have worked out for you in the past. A machine learning implementation of this strategy is straightforward:

collect data on all relevant companies
take all the companies your sales have contacted in the past and label every one as either successful (you made money) or failed (you wasted your time) lead. This is your training set.
extract good features for every company*
train a binary classifier (predicting successful/failed label) on your training set. Ideally this should be a classifier that outputs probabilities.
apply the classifier to all the companies outside the training set and hand the results over to sales. The number assigned by the classifier to a given company is (in theory at least) the probability that the lead will be successful. Highest scorers should be prioritized, low scorers should be left alone. So far, so good.

Forgetting salesman problem

An interesting difficulty frequently arises when you try to apply this approach in real life. Sales teams usually have good records on their clients but often fail to keep track of all the prospects that didn’t work out. As a result, you end up with only positive examples to train your classifier on. It’s not possible in general to train a classifier on a training set with one class amputated this way. But…

The solution

TL;DR = divide density of “good points” by density of all - this is the lead score

Keep in mind that:

in addition to the positive examples, you have also the full set of all data points (all companies whether previously contacted or not)
you are not interested in the absolute probability of success of a given lead per se. You only really need the relative ranking of different leads. All the classifier scores may be off by a factor of 10 but as long as the order is preserved - you’ll still be able to prioritize good leads over rubbish ones.

A solution will present itself once the problem is stated in mathematical terms. Consider the space of all possible company features $X$. Given are:

a set $C$ of points in $X$ - all companies
a subset $T_+$ of the above set - companies that were tested and gave positive result Unknown are:
the probability $f(x)$ that a company $x$ will test positive
the subset $T$ of all tested companies

This is how it looks in an artificial 1-D dataset.

Notice that the distribution of companies $C$ and the conversion probability $f$ peak at different points and the distribution of positive exmaples $T_+$ peaks somewhere between those two.

It’s useful to think of $C$ as a sample from some probabilty distribution $P(x)$ defined on $X$. Or to express the same in the language of point densities in space:

$density\_of\_C(x) = const_1 P(x)$

If the companies tested $T$ were chosen at random from all companies $C$ then they constitute a sample from the same probability distribution $P(x)$

$density\_of\_T(x) = const_2 P(x)$

Since a company’s chance of testing positive is $f$, the density of companies that tested positive is simply:

$density\_of\_T_+(x) = density\_of\_T(x) f(x) = const_2 P(x) f(x)$

The set $T$ is hidden from us but $C$ and $T_+$ are available so we can use them to find $f(x)$. Combining the first and third equation we get:

$\frac{const_2}{const_1} f(x) = \frac{density\_of\_T_+(x)}{density\_of\_C(x)}$

This is it. If we can just estimate the densities of $C$ and $T_+$ - we’re golden. Their ratio gives the $f$ function up to a constant, which is all we need to do lead scoring. There are many ways to do density estimation in practice:

histogram based estimation - lowest on the sophistication scale - divide the space $X$ into sectors and count the number of points in every sector, divide it by volume. Do it for both $C$ and $T_+$ and divide one result by the other
nearest neighbour estimation - for any interesting point find $k$ (let’s say $k=20$) nearest points in $C$. Count how many of them are also in $T_+$, divide the figures
kernel based estimation - in a way this is similar to the knn version - except instead of taking k nearest points you take all of them but discount the farthest by a dicreasing function of distance

This is how it looks on our toy dataset:

Unsuprisingly, on a toy model everything works rather well. In reality conversion rates are low, data is sparse, dimensionality is high and even the basic assumption “good lead means similar to what has worked in the past” is questionable. But you have to start somewhere.

For more information about the ML aspect of classification without negatives google the term “PU learning”.

* this is arguably the hardest and most important step, but I’m glossing over it as it’s not relevant to this post

** if there is prior knowledge regarding how the contacted companies were chosen, it can be easily incorporated

Test Post

Jul 26th, 2015 10:20 pm

This is my first post. pls work pls This is code span (hopefully) def fibo(n): if n < 2: return 1 else: return fibo(n - 1) + fibo(n - 2)

Blog Archives Newer →

DS lore

words about stuff