DS lore

words about stuff

Looking for the Text Top Model

TL;DR: I tested a bunch of neural network architectures plus SVM + NB on several text clasification datasets. Results at the bottom of the post.

Last year I wrote a post about using word embeddings like word2vec or GloVe for text classification. The embeddings in my benchmarks were used in a very crude way - by averaging word vectors for all words in a document and then plugging the result into a Random Forest. Unfortunately, the resulting classifier turned out to be strictly inferior to a good old SVM except in some special circumstances (very few training examples but lots of unlabeled data).

There are of course better ways of utilising word embeddings than averaging the vectors and last month I finally got around to try them. As far as I can tell from a brief survey of arxiv, most state of the art text classifiers use embeddings as inputs to a neural network. But what kind of neural network works best? LSTM? LSTM? CNN? BLSTM with CNN? There are doezens of tutorials on the internet showing how to implement this of that neural classfier and testing it on some dataset. The problem with them is that they usually give metrics without a context. Someone says that their achieved 0.85 accuracy on some dataset. Is that good? Should I be impressed? Is it better than Naive Bayes, SVM? Than other neural architectures? Was it a fluke? Does it work as well on other datasets?

To answer those questions, I implemented several network architectures in Keras and created a benchmark where those algorithms compete with classics like SVM and Naive Bayes. Here it is.

I intend to keep adding new algorithms and dataset to the benchmark as I learn about them. I will update this post when that happens.

Models

All the models in the repository are wrapped in scikit-learn compatible classes with .fit(X, y), .predict(X), .get_params(recursive) and with all the layer sizes, dropout rates, n-gram ranges etc. parametrised. The snippets below are simplified for clarity.

Since this was supposed to be a benchmark of classifiers, not of preprocessing methods, all datasets come already tokenised and the classifier is given a list of token ids, not a string.

Naive Bayes

Naive Bayes comes in two varieties - Bernoulli and Multinomial. We can also use tf-idf weighting or simple counts and we can include n-grams. Since sklearn’s vectorizer expects a string and will be giving it a list of integer token ids, we will have to override the default preprocessor and tokenizer.

1
2
3
4
5
6
7
8
9
10
11
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.naive_bayes import BernoulliNB, MultinomialNB
from sklearn.pipeline import Pipeline
from sklearn.svm import SVC

vectorizer = TfidfVectorizer(
    preprocessor=lambda x: map(str, x),
    tokenizer=lambda x: x,
    ngram_range=(1, 3))

model = Pipeline([('vectorizer', vectorizer), ('model', MultinomialNB())])

SVM

SVMs are a strong baseline for any text classification task. We can reuse the same vectorizer for this one.

1
2
3
from sklearn.svm import SVC

model = Pipeline([('vectorizer', vectorizer), ('model', SVC())])

Multi Layer Perceptron

In other words - a vanilla feed forward neural network. This model doesn’t use word embeddings, the input to the model is a bag of words.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
from keras.models import Sequential
from keras.layers import Dense, Dropout, Activation
from keras.preprocessing.text import Tokenizer

vocab_size = 20000
num_classes = 3

model = Sequential()
model.add(Dense(128, input_shape=(vocab_size,)))
model.add(Activation('relu'))
model.add(Dropout(0.2))
model.add(Dense(128, input_shape=(vocab_size,)))
model.add(Activation('relu'))
model.add(Dropout(0.2))
model.add(Dense(num_classes))
model.add(Activation('softmax'))

model.compile(loss='categorical_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])

Inputs to this model need to be one-hot encoded, same goes for labels.

1
2
3
4
5
6
import keras
from keras.preprocessing.text import Tokenizer

tokenizer = Tokenizer(num_words=vocab_size)
X = tokenizer.sequences_to_matrix(X, mode='binary')
y = keras.utils.to_categorical(y, num_classes)

(Bidirectional) LSTM

This is where things start to get interesting. The input to this model is not a bag of words but instead a sequence word ids. First thing to do is construct an embedding layer that will translate this sequence into a matrix of d-dimensional vectors.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
import numpy as np
from keras.layers import Embedding

max_seq_len = 100
embedding_dim = 37
# we will initialise the embedding layer with random values and set trainable=True
# we could also initialise with GloVe and set trainable=False
embedding_matrix = np.random.normal(size=(vocab_size, embedding_dim))
embedding_layer = Embedding(
    vocab_size,
    embedding_dim,
    weights=[embedding_matrix],
    input_length=max_seq_len,
    trainable=True)

Now for the model proper:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
from keras.layers import Dense, LSTM, Bidirectional
units = 64
sequence_input = Input(shape=(max_seq_len,), dtype='int32')

embedded_sequences = embedding_layer(sequence_input)
layer1 = LSTM(units,
    dropout=0.2,
    recurrent_dropout=0.2,
    return_sequences=True)
# for bidirectional LSTM do:
# layer = Bidirectional(layer)
x = layer1(embedded_sequences)
layer2 = LSTM(units,
    dropout=0.2,
    recurrent_dropout=0.2,
    return_sequences=False)  # last of LSTM layers must have return_sequences=False
x = layer2(x)
final_layer = Dense(class_count, activation='softmax')
predictions = final_layer(x)
model = Model(sequence_input, predictions)

This and all the other models using embeddings requires that labels are one-hot encoded and word id sequences are padded to fixed length with zeros:

1
2
3
4
5
from keras.preprocessing.sequence import pad_sequences
from keras.utils import to_categorical

X = pad_sequences(X, max_seq_len)
y = to_categorical(y, num_classes=class_count)

François Chollet’s CNN

This is the (slightly modified) architecture from Keras tutorial. It’s specifically designed for texts of length 1000, so I only used it for document classification, not for sentence classification.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
from keras.layers import Conv1D, MaxPooling1D

units = 35
dropout_rate = 0.2

x = Conv1D(units, 5, activation='relu')(embedded_sequences)
x = MaxPooling1D(5)(x)
x = Dropout(dropout_rate)(x)
x = Conv1D(units, 5, activation='relu')(x)
x = MaxPooling1D(5)(x)
x = Dropout(dropout_rate)(x)
x = Conv1D(units, 5, activation='relu')(x)
x = MaxPooling1D(35)(x)
x = Dropout(dropout_rate)(x)
x = Flatten()(x)
x = Dense(units, activation='relu')(x)
preds = Dense(class_count, activation='softmax')(x)
model = Model(sequence_input, predictions)

Yoon Kim’s CNN

This is the architecture from the Yoon Kim’s paper, my implementation is based on Alexander Rakhlin’s. This one doesn’t rely on text being exactly 1000 words long and is better suited for sentences.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
from keras.layers import Conv1D, MaxPooling1D, Concatenate

z = Dropout(0.2)(embedded_sequences)
num_filters = 8
filter_sizes=(3, 8),
conv_blocks = []
for sz in filter_sizes:
    conv = Conv1D(
        filters=num_filters,
        kernel_size=sz,
        padding="valid",
        activation="relu",
        strides=1)(z)
    conv = MaxPooling1D(pool_size=2)(conv)
    conv = Flatten()(conv)
    conv_blocks.append(conv)
z = Concatenate()(conv_blocks) if len(conv_blocks) > 1 else conv_blocks[0]

z = Dropout(0.2)(z)
z = Dense(units, activation="relu")(z)
predictions = Dense(class_count, activation="softmax")(z)
model = Model(sequence_input, predictions)

BLSTM2DCNN

Authors of the paper claim that combining BLSTM with CNN gives even better results than using either of them alone. Weirdly, unlike previous 2 models, this one uses 2D convolutions. This means that the receptive fields of neurons run not just across neighbouring words in the text but also across neighbouring coordinates in the embedding vector. This is suspicious because there is no relation between consecutive coordinates in e.g. GloVe embedding which they use. If one neuron learns a pattern involving coordinates 5 and 6, there is no reason to think that the same pattern will generalise to coordinates 22 and 23 - which makes convolution pointless. But what do I know.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
from keras.layers import Conv2D, MaxPool2D, Reshape

units = 128
conv_filters = 32
x = Dropout(0.2)(embedded_sequences)
x = Bidirectional(LSTM(
    units,
    dropout=0.2,
    recurrent_dropout=0.2,
    return_sequences=True))(x)
x = Reshape((2 * max_seq_len, units, 1))(x)
x = Conv2D(conv_filters, (3, 3))(x)
x = MaxPool2D(pool_size=(2, 2))(x)
x = Flatten()(x)
preds = Dense(class_count, activation='softmax')(x)
model = Model(sequence_input, predictions)

Stacking

In addition to all those base models, I implemented stacking classifier to combine predictions of all those very different models. I used 2 versions of stacking. One where base models return probabilities, and those are combined by a simple logistic regression. The other, where base models return labels, and XGBoost is used to combine those.

Datasets

For the document classification benchmark I used all the datasets from here. This includes the 20 Newsgroups, Reuters-21578 and WebKB datasets in all their different versions (stemmed, lemmatised, etc.).

For the sentence classification benchmark I used the movie review polarity dataset and the Stanford sentiment treebank dataset.

Results

Some models were only included in document classification or only in sentence classification - because they either performed terribly on the other or took too long to train. Hyperparameters of the neural models were (somewhat) tuned on one of the datasets before including them in the benchmark. The ratio of training to test examples was 0.7 : 0.3. This split was done 10 times on every dataset and each model was tested 10 time. The tables below show average accuracies across the 10 splits.

Without further ado:

Document classification benchmark

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
model             r8-all-terms.txt    r52-all-terms.txt    20ng-all-terms.txt    webkb-stemmed.txt
--------------  ------------------  -------------------  --------------------  -------------------
MLP 1x360                    0.966                0.935                 0.924                0.930
SVM tfidf 2-gr               0.966                0.932                 0.920                0.911
SVM tfidf                    0.969                0.941                 0.912                0.906
MLP 2x180                    0.961                0.886                 0.914                0.927
MLP 3x512                    0.966                0.927                 0.875                0.915
CNN glove                    0.964                0.920                 0.840                0.892
SVM 2-gr                     0.953                0.910                 0.816                0.879
SVM                          0.955                0.917                 0.802                0.868
MNB                          0.933                0.848                 0.877                0.841
CNN 37d                      0.931                0.854                 0.764                0.879
MNB bi                       0.919                0.817                 0.850                0.823
MNB tfidf                    0.811                0.687                 0.843                0.779
MNB tfidf 2-gr               0.808                0.685                 0.861                0.763
BNB                          0.774                0.649                 0.705                0.741
BNB tfidf                    0.774                0.649                 0.705                0.741

Full results csv.

Sentence classification benchmark

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
model               subjectivity_10k.txt    polarity.txt
----------------  ----------------------  --------------
Stacker LogReg                     0.935           0.807
Stacker XGB                        0.932           0.793
MNB 2-gr                           0.921           0.782
MNB tfidf 2-gr                     0.917           0.785
MNB tfidf 3-gr                     0.916           0.781
MNB tfidf                          0.919           0.777
MNB                                0.918           0.772
LSTM GloVe                         0.921           0.765
BLSTM Glove                        0.917           0.766
SVM tfidf 2-gr                     0.911           0.772
MLP 1x360                          0.910           0.769
MLP 2x180                          0.907           0.766
MLP 3x512                          0.907           0.761
SVM tfidf                          0.905           0.763
BLSTM2DCNN GloVe                   0.894           0.746
CNN GloVe                          0.901           0.734
SVM                                0.887           0.743
LSTM 12D                           0.891           0.734
CNN 45D                            0.893           0.682
LSTM 24D                           0.869           0.703
BLSTM2dCNN 15D                     0.867           0.656

Full results csv.

Conclusions

Well, this was underwhelming.

None of the fancy neural networks with embeddings managed to beat Naive Bayes and SVM, at least not consistently. A simple feed forward neural network with a single layer, did better than any other architecture.

I blame my hyperparameters. Didn’t tune them enough. In particular, the number of epochs to train. It was determined once for each model, but different datasets and different splits probably require different settings.

And yet, the neural models are clearly doing something right because adding them to the ensemble and stacking significantly improves accuracy.

When I find out what exactly is the secret sauce that makes the neural models achieve the state of the art accuracies that papers claim they do, I will update my implementations and this post.

So You Think You Can Stats

TL;DR: I prepared 5 puzzles about statistics that should be accessible to anyone without being trivial. Scroll down for the puzzles.

Data Science and Statistics

“Data science is statistics on a Mac”

“Data Scientist (n.): Person who is better at statistics than any software engineer and better at software engineering than any statistician.”

Then there is the famous Venn diagram with data science on the intersection of statstics, hacking and substantive expertise.

What the hell?

Based on all those memes one would think that data scientists spend equal amounts of time writing code and writing integrals on whiteboards. Thinking about the right data structure and thinking about the right statistical test. Debugging pipelines and debugging equations.

And yet, I can’t remember a single time when I got to solve an integral on the job (and believe me, it’s not for lack of trying). I spent a total of maybe a week in the last 3 years explicitly thinking about statistical tests. Sure, means and medians and variances come up on a daily basis but it would be setting the bar extremely low to call that ‘doing statistics’.

Someone is bound to comment that I’m doing data science wrong or that I’m not a true data scientist. Maybe. But if true data scientist is someone who does statistics more than 10% of the time, then I’m yet to meet one.

The other kind of statistics

But maybe this is the wrong way to think about it. Maybe my problem is that I was expecting mathematical statistics where I should have been expecting real world statistics.

Mathematical statistics is a branch of mathematics. Data scientists like to pretend they do it, but they don’t.

Real world statistics is an applied science. It’s what people actually do to make sense of datasets. It requires a good intuitive understanding of the basics of mathematical statistics, a lot of common sense and only infrequently any advanced knowledge of mathematical statistics. Data scientists genuinely do it, a lot of the time.

In my defense, it was an easy mistake to make. Mathematical statistics is what is taught in most courses and textbooks. If any statistics questions come up in a job interview for a data science role - it will be the mathematical variety.

To illustrate what I mean by ‘real world statistics’, to show that this discipline is not trivial and is interesting in its own right, I prepared a short quiz. There are 5 questions. None of them require any complicated math or any calculations. They do require a good mathematical intuition though.

I encourage you to try to solve all of them yourself before checking the answers. It’s easy to convince yourself that a problem is trivial after you’ve read the solution! If you get stuck though, every question has a hint below.

Questions

Cancer

According to CDC data, US counties with the lowest incidence of kidney cancer happen to all be rural, sparsely populated and located in traditionally Republican states. Can you explain this fact? What does it tell us about the causes of cancer?

Bar Fights

According to a series of interviews conducted with people who have been in a bar fight, 9 out of 10 times, when someone dies in a bar fight, he was the one who started it. How can you explain this remarkable finding?

Competitions

After Google measured on-the-job performance of their programmers, they found a negative correlation between being a former winner of a programming competition and being successful on the job. That’s right - champion coders did worse on average. That raises important questions for recruiters. Do programming competitions make you worse at 9-5 programming? Should employers start screening out champion coders?

Exams

It is well documented that students from underprivileged backgrounds underperform academically at all levels of education. Two students enter a university - one of them comes from an underprivileged group, the other from a privileged one. They both scored exactly the same on the university admission exam. Should you expect the underprivileged student to do better, the same or worse in the next university exam compared to the other student? Bear in mind that while their numerical scores from the admissions test were the same, it means that the underprivileged student heavily outperformed expectations based on his/her background while the other student did as well as expected from his/her background.

Sex partners

According to studies the average number of sex partners Britons have had in their lives is 9.3 for men and 4.7 for women. How can those numbers possibly be different? After all, each time a man and a woman become sex partners, they increase the total sex partners tally for both sexes by +1. Find at least 5 different factors that could (at least in theory) account for the difference and rate them for plausibility.

Hints

Cancer

Bar Fights

Competitions

Exams

Sex Partners





















Answers

Cancer

It tells us nothing about causes of cancer, it’s a purely statistical effect and it has to be this way. Sparsely populated counties have less people in them, so sampling error is higher. That’s it. Think about an extreme case - a county with a population of 1. If the only inhabitant of this county gets kidney cancer, the county will have 100% kidney cancer rate! If this person remains healthy instead, the county will have cancer incidence rate of 0%. It’s easier for a small group of people to have extremely high or extremely low rate of anything just by chance. Needless to say, republicanism has nothing to do with cancer (as far as we know) - it’s just that rural areas are both sparsely populated and tend to lean Republican.

This example comes from Daniel Kahneman’s awesome book Thinking Fast And Slow. This blog post has a really nice visualisation of the actual CDC data that illustrates this effect.

Bar Fights

People lie. Of course the dead one will be blamed for everything!

Competitions

This one is slightly more subtle. It is not inconceivable that being a Programming Competition Winner (PCW) makes one less likely to be a Good Engineer (GE). But this is not the only and IMO not the most plausible explanation of the data. It could very well be that in the general population there is no correlation between GE and PCW or a positive correlation and the observed negative correlation is purely due to Google’s hiring practices. Imagine a fictional hiring policy where Google only hires people who either won a competition (PCW) or are proven superstar engineers (GE) - based on their open source record. In that scenario any engineer working at Google who was not a PCW would automatically be GE - hence a negative correlation between GE and PCW among googlers. The correlation in the subpopulation of googlers may very well be the opposite of the correlation in the entire population. Treating PCW as a negative in hiring programmers would be premature.

Erik Bernhardsson has a post with nice visual illustration of this phenomenon (which is an of Berkson’s Paradox). The same principle also explains why all handsome men you date turn out to be such jerks.

Exams

The underprivileged student should be expected to do worse. The reason is simple - the admissions test is a measure of ability but it’s not a perfect measure. Sometimes students score lower or higher than their average just by chance. When a student scores higher/lower than expected (based on demographics and whatever other information you have) it is likely that the student was lucky/unlucky in this particular test. The best estimate of the student’s ability lies somewhere between the actual test score and our prior estimate (which here is based on the demographics).

To convince yourself that it must be so, consider an example from sports. If a third league football team like Luton Town plays a top club like Real Madrid and ties, you don’t conclude that Luton Town is as good as Real Madrid. You raise your opinion of Luton Town and lower your opinion of Real Madrid but not all the way to the same level. You still expect Real Madrid to win the rematch.

This effect is an example of regression to the mean and it is known as Kelley’s Paradox. This paper illustrates it with figures with actual data from SAT and MCAT exams. You will see that the effect is not subtle!

Sex Partners

Average number of sex partners for males is the sum of the numbers of sex partners for all the males divided by the number of all the males:

similarly for females:

The reason we think $MSP$ and $FSP$ should be equal is that every time a man and a woman become sex partners, the numerators of both $MSP$ and $FSP$ increase by $+1$. And the denominators are approximately equal too. Let’s list all the ways this tautology breaks down in real life:

  1. People lie.
  2. There are are more homosexual men than homosexual women and they tend to have more partners. A homosexual relationship between men contributes $+2$ to the numerator of $MSP$ but not to $FSP$.
  3. Non-representative sample. If for example prostitutes are never polled or refuse to answer the survey, that could seriously lower the estimate (but not the real value) of the female average.
  4. Men and women may be using different definitions of sexual intercourse. I leave it to the reader to imagine all the situations that the male but not the female participant would describe as having had sex - without either of them technically lying. In such a situation only the numerator of $MSP$ increases. This may or may not be an issue depending on the exact phrasing of the survey.
  5. There are actually more women then men, so the denominator of $FSP$ is higher. This effect is undoubtedly real but too tiny to explain anything.

And there are other factors as well, although it’s not clear to me which way would they tend to bias the ratio:

  1. Britons may be having sex partners outside UK. This may be either while they are travelling abroad or the sex partner may be a tourist visiting UK. Each such partner would only contribute to one of $MSP$, $FSP$ but not the other.
  2. Immigration and emigration both lead to a situation where some of the sex partners of people who currently live in the UK don’t themselves (currently) live in the UK. Depending on the sex partner statistics of the people immgrating to/emgigrading from the UK, this may contribute to the $MSP$, $WSP$ discrepancy.
  3. People are dropping out of the population by dying. This, combined with sex differences in the age people have sex, can result in a discrepancy between $MSP$ and $FSP$. Consider a country where every male finds 3 female sexual partners as soon as he turns 18 but those partners are exclusively women on their deathbeds. In such country almost every adult male would have had 3 sex partners and almost every female would have had 0 (except for a tiny fraction of females who are about to die).

Conclusions

  • random sampling error produces non-random seeming results (Cancer)
  • the measurement method affects the outcome (Bar Fights)
  • nonrepresentative samples lead to spurious correlations (Competitions)
  • measurements are never 100% reliable. An accurate estimate of a quantity must combine the measurement with prior distribution (Exams)
  • seemingly well defined concepts at closer inspection turn out to be slippery (Sex Partners)

Loafing Around With XGBoots

This is a guest post by Javier Rodriguez Zaurin.

My good friend Nadbor told me that he found on Reddit someone asking if data scientists end up doing boring tasks such as classifying shoes. As someone that has faced this problem in the past, I was committed to show that classifying shoes it is a challenging, entertaining task. Maybe the person who wrote that would find it more interesting if the objects to classify were space rockets, but whether rockets or shoes, the problem is of the same nature.

THE PROBLEM

Imagine that you work at a fashion aggregator, and every day you receive hundreds of shoes in the daily feed. The retailers send you one identifier and multiple images (with different points of view) per shoe model. Sometimes, they send you additional information indicating whether one of the images is the default image to be displayed at the site, normally, the side-view of the shoe. However, this is not always the case. Of course, you want your website to look perfect, and you want to consistently show the same shoe perspective across the entire site. Therefore, here is the task: how do we find the side view of the shoes as they come through the feed?

THE SOLUTION

Before I jump into the technical aspect of the solution, let me just add a few lines on team-work. Through the years in both real science and data science, I have learned that cool things don’t happen in isolation. The solution that I am describing here was part of a team effort and the process was very entertaining and rewarding.

Let’s go into the details.

The solution implemented comprised two steps:

1-. Using the shape context algorithm to parameterise shoe-shapes

2-. Cluster the shapes and find those clusters that are comprised mostly by side-view shoes

THE SHAPE CONTEXT ALGORITHM

Details on the algorithm can be found here and additional information on our python implementation is here. The steps required are mainly two:

1-. Find points along the silhouette of the shoe useful to define the shape.

2-. Compute a Shape Context Matrix using radial and angular metrics that will effectively parameterise the shape of the shoe.

1-. FIND THE RELEVANT POINTS

Finding the relevant points to be used later to compute the Shape Context Matrix is relatively easy. If the background of the image is white, simply “slice” the image and find the initial and final points that are not background per slice. Note that due to the “convoluted” shapes of some shoes, techniques relying on contours might not work here.

I have coded a series of functions to make our lives easier. Here I show the results of using some of those functions.

The figure shows 60 points of interest found as we move along the image horizontally.

2-. SHAPE CONTEXT MATRIX

Once we have the points of interest we can compute the radial and angular metrics that will eventually lead to the Shape Context Matrix. The idea is the following: for a given point, compute the number of points that fall within a radial bin and an angular bin relative to that point.

In a first instance, we computed 2 matrices, one containing radial information and one containing angular information, per point of interest. For example, if we select 120 points of interest around the silhouette of the shoe, these matrices will be of dim (120,120).

Once we have these matrices, the next step consists in building the shape context matrix per point of interest. Eventually, all shape context matrices are flattened and concatenated resulting in what is referred to as Bin Histogram.

Let’s have a look at one of these shape context matrices. For this particular example we used 6 radial bins and 12 angular bins. Code to generate this plot can be found here:

This figure has been generated for the first point within our points-of-interest-array and is interpreted as follows: if we concentrate on the upper-left “bucket” we find that, relative to the first point in our array, there are 34 other points that fall within the largest radial bin (labelled 0 in the Figure) and within the first angular bin (labelled 0 in the Figure). More details on the interpretation can be found here

Once we have a matrix like the one in Figure 2 for every point of interest, we flatten and concatenate them resulting in an array of 12 $\times$ 6 $\times$ number of points (120 in this case), i.e. 8640 values. Overall, after all this process we will end up with a numpy array of dimensions (number of images, 8640). Now we just need to cluster these arrays.

RESULTS

A detailed discussion on how to pick the number of clusters and the potential caveats can be found here. In this post I will simply show the results of using MiniBatchKMeans to cluster the arrays using 15 clusters. For example, clusters 2,3 and 10 look like this.

Interestingly cluster 1 is comprised of images with an non-white and/or structured background, images with a shape different than that of a shoe and some misclassifications. Some advise on how to deal with the images in that cluster can be found here

MOVING FORWARD

There are still a few aspects to cover to isolate the side views of the shoes with more certainty, but I will leave this for a future post (if I have the time!).

In addition, there are some other features and techniques one could try to improve the quality of the clusters, such as GIST indicators or Halarick Textural Features.

Of course, if you have the budget, you can always pay for someone to label the entire dataset, turn this into a supervised problem and use Deep Learning. A series of convolutional layers should capture shapes, colours and patterns. Nonetheless, if you think for a second about the nature of this problem, you will see that even deciding the labelling is not a trivial task.

Anyway, for now, I will leave it here!

The code for the process described in this post can be found here

You Won't Believe How This Islington Single Dad Is Making £500/day While Working From Home

Trigger warnings: programming humor, algorithms and data structures, Java

I’m interviewing data engineering contractors recently. All of the candidates are very senior people with 10+ years of experience. My go to question:

Me: What data structure would you use (in your favorite programming language) to store a large number (let’s say 100k) of strings - so they can be looked up efficiently? And by ‘looked up’ I mean - user will come up with a new string (‘banana’) and you have to quickly tell if this string is an element of your collection of 100k?
Candidate: I would load them in an RDD and then…
Me: No, no, I’m not asking about Spark. This is a regular single-threaded, in-memory, computer science 101 problem. What is the simplest thing that you could do?
Candidate: Grep. I would use grep to look for the string.
Me: Terrific. Sorry, maybe I wasn’t clear, I’m NOT talking about finding a substring in a larger text… You know what, forget about the strings. There are no strings. You have 100k integers. What data structure would you put them in so you can quickly look up if a new integer (1743) belongs to the collection?
Candidate: For integers I would use an array.
Me: And how do you find out if the new integer belongs to this array?
Candidate: There is a method ‘contains’.
Me: Ok. And for an array of n integers, what is the expected running time of this method in terms of n?
Candidate:
Me:
Candidate: I think it would be under one minute.
Me: Indeed.

This one was particularly funny, but otherwise unexceptional. This week I interviewed 4 people and not a single one of them mentioned hash tables. I would have also accepted ‘HashMap’, ‘Map’, ‘Set’, ‘dictionary’, ‘python curly braces’ - anything pointing in vaguely the right direction, even if they didn’t understand the implementation. Instead I only got ‘a vector, because they are thread safe’, ‘ArrayList because they are extensible’, ‘a List because lists in scala are something something’, ‘in my company we always use Sequences’. Again: these are very experienced people who are being paid a lot of money contracting for corporations in London and who can very convincingly bullshit about their Kafkas, Sparks, HBases and all the other Big Data nonsense.

Another bizarre conversation occurred when a candidate with 16 years of experience with Java (confirmed by the Sun certificate) immediately came up with the idea of putting the strings in buckets based on their hash and started explaining to me basically how to implement a hash table in Java, complete with the discussion of the merits of different string hashing functions. When I suggested that maybe Java already has a collection type that does all of this he reacted with indignation - he shouldn’t have to know this, you can find out on the internet. Fair enough, but one would think that after 16 years of programming in that language someone would have encountered HashMaps once or twice… This seemed odd enough that for my next question I went off script:

Me: Can you tell me what is the signature of the main method in Java?
Candidate: What?
Me: Signature of the main method. Like, if you’re writing the ‘hello world’ program in Java, what would you have to type?
Candidate: class HelloWorld
Me: Go on.
Candidate: int main() or void main() I think
Me: And the parameters?
Candidate: Yes, I remember, there are command line parameters.
Me:
Candidate: Two parameters and the second is an integer.
Me: Thank you, I know all I wanted to know.

Moral of this story?

Come to London, be a data engineering contractor and make £500/day. You can read about Java on wikipedia, put 15 years of experience on your resume and no one will be the wiser.

Python or Scala - Let the Neural Network Decide.

This is the second post about my experiments with LSTMs. Here’s the first one. This is a great introduction by Karpathy. And this is an in depth explanation of the math behind.

Python or Scala?

Which should you use and when? Which should you learn first? Is type safety more important than flexibility? Is Python fast enough for performance-heavy applications? Is Scala’s machine learning ecosystem mature enough for serious data science? Are indents better than braces?

This post won’t answer any of those questions.

I will show how to solve a related problem though. Given the following text, which was stitched together from bits of scikit-learn and scalaz code files, can you tell where does Python end and Scala begin?

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
package scalaz
package syntax

"""
Extended math utilities.
"""
# Authors: Gael Varoquaux
# Alex/** Wraps a value `selfandre Gramfort
# Alexandre T. Passos
# Olivier Grisel
# Lars Buitinck
# Stefan van der Walt
# Kyle Kastner
# Giorgio Patrini
# License:` and provides methods related to `MonadPlus` */
final class MonadPlusOps[F[_],A] private[syntax](val self: BSD 3 clause

from __future__ import division
from functools import partial
import warnings

import numpy as np
from scipy import linalg
from scipy.sparse import issparse, csr_matr F[A])(implicit val F: MonadPlus[F]) extends Ops[F[A]] {
////
impoix

from . import check_random_state
from .fixrt Leibniz.===

def filter(f: A => Boolean): F[A] =
F.filter(self)(f)

def withFilter(f: A => Boolean): F[A] =
filter(f)

final def uniteU[T](implicit T: Unapply[Foldable, Aes import np_version
from ._logistic_sigmoid import _log_logistic_sigmoid
from ..extern]): F[T.A] =
F.uniteU(self)(T)

def unite[T[_], B](implicit ev: A === T[B], T: Foldable[T]): F[B] = {
val ftb: F[T[B]] = ev.subst(seals.six.moves import xrange
from .sparsefuncs_fast import csr_row_norms
from .validation import check_array
from ..exceptions import NonBLASDotWarning


lf)
F.unite[T, B](ftb)
}
final def lefts[G[_, _], B, C](implicit ev: A === G[B, C], G: Bifoldable[G]): F[B] =
F.lefts(ev.subst(self))

final def rigdef norm(x):
"""Compute the Euclidean or Frobenius norm of x.

hts[G[_, _], B, C](implicit ev: A === G[B, C], G: Bifoldable[G]): F[C] =
F.rights(ev.subst(self))

final def separate[G[_, _], Returns the Euclidean norm when x is a vector, the Frobenius norm when x
is a matrix (2-d array). More precise than sqrt(squared_norm(x)).
"""
x = np.asarray(x)
nrm2, = lin B, C](implicit ev: A === G[B, C], G: Bifoldable[G]): (F[B], F[C]) =
F.separate(ev.subst(self))

////
}

sealed trait ToMonadPlusOps0 {
implicit def Talg.get_blas_funcs(['nrm2'], [x])
return nrm2(x)


# Newer NumPy has a ravel that needs leoMonadPlusOpsUnapply[FA](v: FA)(implicit F0: Unapply[MonadPlus, FA]) =
new MonadPlusOps[F0.M,F0.A](F0(v))ss copying.
if np_version < (1, 7, 1):
_ravel = np.ravel
else:
_ravel = partial(np.ravel, order='K')


def squared_no(F0.TC)

}

trait ToMonadPlusOps extends ToMonadPlusOps0 with ToMonadOps with ToApplicatrm(x):
"""Squared Euclidean or Frobenius norm of x.

Returns the Euclidean norm when x is a vector, the Frobenius norm when x
is a matrix (2-d array). Faster than norm(ivePlusOps {
implicit def ToMonadPlusOps[F[_],A](v: F[A])(implicit F0: MonadPlus[F]) =
new MonadPlusOps[F,A](v)

////

////
}

trait MonadPlusSyntax[F[_]] extends MonadSyntax[F] withx) ** 2.
"""
x = _ravel(x)
if np.issubdtype(x.dtype, np.integer):
ApplicativePlusSyntax[F] {
implicit def ToMonadPlusOps[A](v: F[A]): MonadPlusOps[F, A] = ne warnings.warn('Array type is integer, np.dot may overflow. '
'Data should be float type to avoid this issue',
UserWarning)
return np.dot(xw MonadPlusOps[F,A](v)(MonadPlusSyntax.this.F)

def F: MonadPlus[F]
////

////
}
package scalaz
package syntax

/** Wraps a value `self` and provides methods, x)


def row_norms(X, squared=False):
"""Row-wise (squared) Euclidean norm of X.

E related to `Traverse` */
final class Tquivalent to np.sqrt((X * X).sum(axis=1)), but also supporaverseOps[F[_],A] private[syntax](val self: F[A])(implicit val F: Traverse[F]) exterts sparse
matrices and does not create an X.shape-sized temporary.

Performs no input valnds Ops[F[A]] {
////

import Leibniz.===

I will show how Keras LSTMs and bidirectional LSTMs can be used to neatly solve this problem. The post will contain a some snippets of code but the full thing is here.

The problem

I once interviewed with a cyber security company that was scraping the web looking for people’s phone numbers, emails, credit card numbers etc. They asked me how I would go about building a model that finds those things in text files and also categorizes the files into types like ‘email’, ‘server logs’, ‘code’, etc.

The boring way

The boring answer is that with enough feature engineering you could classify files pretty well with any old ML algorithm. If all lines have a common prefix -

1
2
3
4
5
6
123.123.123.123 - - [26/Apr/2000:00:23:48 -0400] "GET /pics/wpaper.gif HTTP/1.0" 200 6248 "http://www.jafsoft.com/asctortf/" "Mozilla/4.05 (Macintosh; I; PPC)"
123.123.123.123 - - [26/Apr/2000:00:23:47 -0400] "GET /asctortf/ HTTP/1.0" 200 8130 "http://search.netscape.com/Computers/Data_Formats/Document/Text/RTF" "Mozilla/4.05 (Macintosh; I; PPC)"
123.123.123.123 - - [26/Apr/2000:00:23:48 -0400] "GET /pics/5star2000.gif HTTP/1.0" 200 4005 "http://www.jafsoft.com/asctortf/" "Mozilla/4.05 (Macintosh; I; PPC)"
123.123.123.123 - - [26/Apr/2000:00:23:50 -0400] "GET /pics/5star.gif HTTP/1.0" 200 1031 "http://www.jafsoft.com/asctortf/" "Mozilla/4.05 (Macintosh; I; PPC)"
123.123.123.123 - - [26/Apr/2000:00:23:51 -0400] "GET /pics/a2hlogo.jpg HTTP/1.0" 200 4282 "http://www.jafsoft.com/asctortf/" "Mozilla/4.05 (Macintosh; I; PPC)"
123.123.123.123 - - [26/Apr/2000:00:23:51 -0400] "GET /cgi-bin/newcount?jafsof3&width=4&font=digital&noshow HTTP/1.0" 200 36 "http://www.jafsoft.com/asctortf/" "Mozilla/4.05 (Macintosh; I; PPC)"

- then we’re probably dealing with a log file. If we’re there’s a lot of camelCase() - that means we’re seeing code. And so on.

Finding e.g. phone numbers in text is more involved but still doable this way. You would have to first generate potential potential matches using regular expressions and then classify each as a true or spurious based on the context it appears in.

Inevitably, for every new file type and every type of entity to be found in the file, one would have to come up with new features and maybe train a separate classifier.

Super tedious.

The RNN way

The fun and potentially superior solution uses char-RNNs. Instead of all those handcrafted features and regular expressions and different models, we can train a single recurrent neural network to label each character in the text as either belonging to a phone number (credit card number, email …) or not. If we do it right and have enough training data, the network should be able to learn that phone numbers are more likely to occur in emails than in server logs and that Java code tends to use camel case while Python has indented blocks following a colon - and all kinds of other features that would otherwise have to be hardcoded.

Let’s do it!

Implementation

As it turned out, the hardest part was getting and preparing the data. Since I don’t have access to a labeled dataset with phone numbers and emails, I decided to create an artificial one. I took all the Python files from scikit-learn repository and all the Scala files from scalaz and spliced them together into one giant sequence of characters. The sequence takes a few dozen consecutive characters from a Python file, then a few dozen from a Scala file, then Python again and so on. The result is the Frankenstein’s monster at the top of the post (except tens of megabytes more of it).

Preparing training data

The sequence made up of all the Python and Scala files wouldn’t fit in my RAM (Big Data, as promised ;), so it is generated online during training, using a generator:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
from random import choice

def chars_from_files(list_of_files):
    # reads a list of files in random order and yields
    # one character at a time     
    while True:
        filename = choice(list_of_files)
        with open(filename, 'rb') as f:
            chars = f.read()
            for c in chars:
                yield c

def splice_texts(files_a, files_b):
    """ Takes two lists of files and generates a sequence
    of characters from those files. Yields pairs:
    (character, index of the source - 0 or 1)
    """
    a_chars = chars_from_files(files_a)
    b_chars = chars_from_files(files_b)
    generators = [a_chars, b_chars]

    # take between 20 and 50 characters from one source
    # before moving to the other source    
    jump_range = range(20, 50)

    source_ind = choice([0, 1])
    while True:
        jump_size = choice(jump_range)
        gen = generators[source_ind]
        for _ in range(jump_size):
            yield (gen.next(), source_ind)
        source_ind = 1 - source_ind

# it can be used like this
gen = splice_texts(["file1.txt", "file2.txt"], ["file3.txt", "file4.txt"])
char_1, label_1 = gen.next()
char_2, label_2 = gen.next()
# and so on ...

The other reason for using a generator is that the sequence can be randomized (both the order of files and the number of consecutive characters taken from one source). This way the network will never see the same sequence twice which will reduce overfitting.

Next step is encoding the characters as vectors (one-hot-encoding):

1
2
3
4
5
6
7
8
9
10
import numpy as np

# Only allowing these characters:
chars = '\n !"#$%&\'()*+,-./0123456789:;<=>?@[\\]^_`abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ{|}~'
char2ind = dict((c, i) for i, c in enumerate(chars))
char2vec = {}
for c in chars:
    vec = np.zeros(len(chars))
    vec[char2ind[c]] = 1
    char2vec[c] = vec

To take advantage of the parallel processing powers of the GPU, the input vectors need to be shaped into batches. Keras requires that batches for LSTM be 3-dimensional arrays, where first dimension corresponds to the number of samples in a batch, second - number of characters in a sequence and third - dimensionality of the input vector. The latter is in our case equal to the number of characters in our alphabet.

For example, if there were only two sequences to encode, both of length 4, and only 3 letters in the alphabet, this is how we would construct a batch:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
# sequences to encode:
# 'abca'
# 'cacb'

# vectors corresponding to characters
a = [1,0,0]
b = [0,1,0]
c = [0,0,1]

batch = np.array([
    [a,b,c,a],
    [c,a,c,b]
])
# batch.shape gives (2, 4, 3)
# which is = (number of sequences, length of a sequence, number of available chars)

If the sequences are too long to fit in one batch - as they are in our case - they need to be split into multiple batches. This would ordinarily mean losing some context information for characters that are near the boundary of a sequence chunk. Fortunately Keras LSTM has a setting stateful=True which tells the network that the sequences from one batch are continued in the next one. For this to work, the batches must be prepared in a specific way, with n-th sequence in a batch being continued in the n-th sequence of the next batch.

1
2
3
4
5
6
7
8
9
10
11
12
13
# sequences to encode:
# 'abcdefgh'
# 'opqrstuv'

batch_1 = np.array([
    [a,b,c,d],      # first element of first batch
    [o,p,q,r]       # second element of first batch
])
# i-th element of second batch is the continuation of i-th element of first_batch
batch_2 = np.array([
    [e,f,g,h],      # first element of second batch
    [s,t,u,v]       # second element of second batch
])

In our case, each sequence is produced by a generator reading from files. We will have to start a number of generators equal to the desired batch size.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
def generate_batches(files_a, files_b, batch_size, sequence_len):
    gens = [splice_texts(files_a, files_b) for _ in range(batch_size)]
    while True:
        X = []
        y = []
        for g in gens:
            vecs = []
            labels = []
            for _ in range(sequence_len):
                c, l = g.next()
                vecs.append(char2vec[c])
                labels.append([l])
            X.append(vecs)
            y.append(labels)

        yield (np.array(X), np.array(y))

Done. This generator produces batches accepted by Keras’ LSTM. batch_size and sequence_len settings influence GPU/CPU utilisation but otherwise shouldn’t make any difference (as long as stateful=True!).

The network

Now for the easy part. Construct the network:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
from keras.layers import Dense, Dropout, LSTM, TimeDistributed
from keras.models import Sequential

batch_size = 1024
seq_len = 100
n_chars = 96
rnn_size = 128
batch_shape = (batch_size, seq_len, n_chars)

model = Sequential()
# Let's use 3 LSTM layers, because why not
model.add(LSTM(rnn_size, return_sequences=True, batch_input_shape=batch_shape, stateful=True))
model.add(Dropout(dropout_rate))
model.add(LSTM(rnn_size, return_sequences=True, batch_input_shape=batch_shape, stateful=True))
model.add(Dropout(dropout_rate))
model.add(LSTM(rnn_size, return_sequences=True, batch_input_shape=batch_shape, stateful=True))
model.add(Dropout(dropout_rate))

model.add(TimeDistributed(Dense(units=1, activation='sigmoid')))
model.compile(optimizer='adam', loss='mse', metrics=['accuracy', 'binary_crossentropy'])

And train it:

1
2
3
4
5
6
from keras.callbacks import ModelCheckpoint

model_path = "models/my_model"
generator = generate_batches(files_a, files_b, batch_size, seq_len)
checkpointer = ModelCheckpoint(model_path)
model.fit_generator(generator, steps_per_epoch=1000, epochs=10, callbacks=[checkpointer])

Making predictions is just as easy:

1
predictions = model.predict_generator(generator, steps=50)

That’s it! The full code I used has a few more bells and whistles, but this is the core of it.

I have split the Python and Scala files into train and test sets (80:20) and trained the network on the training set for a few hours. This is what the network’s prediction on the test set (same text as on top of of this post) looks like:

package scalaz
package syntax

"""
Extended math utilities.
"""
# Authors: Gael Varoquaux
# Alex/** Wraps a value `selfandre Gramfort
# Alexandre T. Passos
# Olivier Grisel
# Lars Buitinck
# Stefan van der Walt
# Kyle Kastner
# Giorgio Patrini
# License:` and provides methods related to `MonadPlus` */
final class MonadPlusOps[F[_],A] private[syntax](val self: BSD 3 clause

from __future__ import division
from functools import partial
import warnings

import numpy as np
from scipy import linalg
from scipy.sparse import issparse, csr_matr F[A])(implicit val F: MonadPlus[F]) extends Ops[F[A]] {
////
impoix

from . import check_random_state
from .fixrt Leibniz.===

def filter(f: A => Boolean): F[A] =
F.filter(self)(f)

def withFilter(f: A => Boolean): F[A] =
filter(f)

final def uniteU[T](implicit T: Unapply[Foldable, Aes import np_version
from ._logistic_sigmoid import _log_logistic_sigmoid
from ..extern]): F[T.A] =
F.uniteU(self)(T)

def unite[T[_], B](implicit ev: A === T[B], T: Foldable[T]): F[B] = {
val ftb: F[T[B]] = ev.subst(seals.six.moves import xrange
from .sparsefuncs_fast import csr_row_norms
from .validation import check_array
from ..exceptions import NonBLASDotWarning


lf)
F.unite[T, B](ftb)
}
final def lefts[G[_, _], B, C](implicit ev: A === G[B, C], G: Bifoldable[G]): F[B] =
F.lefts(ev.subst(self))

final def rigdef norm(x):
"""Compute the Euclidean or Frobenius norm of x.

hts[G[_, _], B, C](implicit ev: A === G[B, C], G: Bifoldable[G]): F[C] =
F.rights(ev.subst(self))

final def separate[G[_, _], Returns the Euclidean norm when x is a vector, the Frobenius norm when x
is a matrix (2-d array). More precise than sqrt(squared_norm(x)).
"""
x = np.asarray(x)
nrm2, = lin B, C](implicit ev: A === G[B, C], G: Bifoldable[G]): (F[B], F[C]) =
F.separate(ev.subst(self))

////
}

sealed trait ToMonadPlusOps0 {
implicit def Talg.get_blas_funcs(['nrm2'], [x])
return nrm2(x)


# Newer NumPy has a ravel that needs leoMonadPlusOpsUnapply[FA](v: FA)(implicit F0: Unapply[MonadPlus, FA]) =
new MonadPlusOps[F0.M,F0.A](F0(v))ss copying.
if np_version < (1, 7, 1):
_ravel = np.ravel
else:
_ravel = partial(np.ravel, order='K')


def squared_no(F0.TC)

}

trait ToMonadPlusOps extends ToMonadPlusOps0 with ToMonadOps with ToApplicatrm(x):
"""Squared Euclidean or Frobenius norm of x.

Returns the Euclidean norm when x is a vector, the Frobenius norm when x
is a matrix (2-d array). Faster than norm(ivePlusOps {
implicit def ToMonadPlusOps[F[_],A](v: F[A])(implicit F0: MonadPlus[F]) =
new MonadPlusOps[F,A](v)

////

////
}

trait MonadPlusSyntax[F[_]] extends MonadSyntax[F] withx) ** 2.
"""
x = _ravel(x)
if np.issubdtype(x.dtype, np.integer):
ApplicativePlusSyntax[F] {
implicit def ToMonadPlusOps[A](v: F[A]): MonadPlusOps[F, A] = ne warnings.warn('Array type is integer, np.dot may overflow. '
'Data should be float type to avoid this issue',
UserWarning)
return np.dot(xw MonadPlusOps[F,A](v)(MonadPlusSyntax.this.F)

def F: MonadPlus[F]
////

////
}
package scalaz
package syntax

/** Wraps a value `self` and provides methods, x)


def row_norms(X, squared=False):
"""Row-wise (squared) Euclidean norm of X.

E related to `Traverse` */
final class Tquivalent to np.sqrt((X * X).sum(axis=1)), but also supporaverseOps[F[_],A] private[syntax](val self: F[A])(implicit val F: Traverse[F]) exterts sparse
matrices and does not create an X.shape-sized temporary.

Performs no input valnds Ops[F[A]] {
////

import Leibniz.===

final def tmap[B](f: A => B): F[B] =
F.map(seidation.
"""
if issparse(X):
if not isinstance(X, csr_matrix):

Font size shows the true label (small - Python, big - Scala) and background color represents the network’s prediction (white - Python, dark red - Scala).

It’s pretty good overall, but network keeps making a few unforced errors. Consider this bit:

package scalaz
package syntax

"""

  • it is very unsure about the first few characters of the input. Even though package scalaz should be a dead giveaway, the prediction only becomes confident at about the character ‘g’
  • it is sometimes too slow to change the prediction. Like in the case of Python’s triple quotation marks """" following a stretch of Scala code. Triple quotes should immediately be labeled as Python but only the third one is.

These mistakes stem from the fact that the RNN doesn’t look ahead and can only interpret a character in the context of characters that came before. Triple quotes almost certainly come from a stretch of Python code, but you don’t know that you’re seeing triple quotes until you’ve seen all three. That’s why the prediction gradually changes from Scala to Python (red to white) as the RNN encounters the second and third consecutive quote.

This problem actually has a straightforward solution - bidirectional RNN. It’s a type of RNN where the sequence is fed to it from both ends at the same time. This way, the network will be aware of the second and third quotation marks already when it’s producing the label for the first one.

To make the LSTM bidirectional in Keras one needs simply to wrap it with the Bidirectional wrapper:

1
2
3
4
5
6
from keras.layers import Bidirectional

model.add(Bidirectional(LSTM(rnn_size, return_sequences=True, stateful=True), batch_input_shape=batch_shape))

# instead of
# model.add(LSTM(rnn_size, return_sequences=True, batch_input_shape=batch_shape))

Everything else stays the same.

Here’s a sample of results from a bidirectional LSTM:

package scalaz
package std

import std.AllInstances._
import scalaz.scalacheck.ScalazProperties._
import scalaz.scalac"""
===============================heck.ScalazArbitrary._
import org.scalacheck.{Gen, Arbitrary}
import Lens.{lens => _, _}
import org.scalacheck.Prop.fo=========
Comparison of Calibration of ClassifrAll

object LensTest extends SpecLite {

{
implicit def lensArb = Arbitrary(Gen.const(Lens.lensId[Int]))
implicit def lensEqual = new Equal[Lens[Int, Iiers
========================================

Well calibrated classifiers are probabint]] {
def equal(a1: Lens[Int, Int], a2: Lens[Int, Int]): Boolean = a1.get(0) == a2.get(0)
}
checkAll("Lens", category.laws[Lens]) // not really testing much!
}

checkAll("id",listic classifiers for which the output
of the predict_proba method can be directly interpreted as a confidence level.
For instance a well calibrated (binary) classifier should classify the samp lens.laws(Lens.lensId[Int]))
checkAll("trivialles
such that among the samples to which it gave a predict_proba", lens.laws(Lens.trivialLens[Int]))
checkAll("codiagLens", lens.laws(Lens.codiagLens[Int]))
checkAll("Tuple2.first", lens.laws(Lens.firstLens[Int, Int]))
checkAll("Tuple2.second", le value close to
0.8, approx. 80% actually belong to the positive class.

Logisticns.laws(Lens.secondLens[Int, Int]))
checkAll("Set.containRegression returns well calibrated predictions as it directly
os", lens.laws(Lens.lensId[Set[Int]].contains(0)))
checkAll("Map.member", lens.laws(Lens.lensId[Map[Boolean, Int]].ptimizes log-loss. In contrast, the othemember(true)))
checkAll("sum", lens.laws(Lens.firsr methods return biased probabilities,
with different biases per method:

* GaussianNaiveBayes tends to push probabilities to 0 otLens[Int, String].sum(Lens.firstLens[Int, String])))

"NumericLens" should {
"+=" ! forAll((i: Int) => (Lens.lensId[Int] += i).run(1) must_=== ((i + 1) -> (i +

I think this looks better overall. The problem of updating prediction too slowly is mostly gone - package scalaz is marked as Scala immediately, starting with the letter ‘p’. However, now the network started making weird mistakes in the middle of a word for no reason. Like this one:

Comparison of Calibration

Why is the middle of the ‘Calibration’ all of a sudden marked as Scala?

The culprit is statefulness. Remember that stateful=True means that for each sequence in a batch, the state of the network at the beginning of a sequence is reused from the state at the end of the previous sequence*. This acts as if there were no batches, just one unending sequence. But in a bidirectional layer the sequence is fed to the network twice, from both directions. So half of the state should be borrowed from the previous sequence, and half from the next sequence that has not been seen yet! In reality all of the state is reused from previous sequence, so half of the network ends up in the wrong state. This is why those weird mispredictions appear and appear at regular intervals. At the beginning of a new batch, half of the network is in the wrong state and starts predicting the wrong label.

* or more precisely, the state at the end of the corresponding sequence in the previous batch

Let’s get rid of statefulness in the bidirectional version of the network:

1
model.add(Bidirectional(LSTM(rnn_size, return_sequences=True, stateful=False), batch_input_shape=batch_shape))

Unfortunately this means that we will have to use longer sequences (in the previous experiments I used 128 characters, now 200) to give the network more context for labeling a character. And even with that, prediction for characters near the boundary between consecutive sequences is bound to be poorer - like in regular unidirectional LSTM. To make up for it I decided to give the network more layers (4) and more time to train (a day). Let’s see how it worked out:

package scalaz

import scalaz.syntax.equal._
import scalaz.syntax.show._

sealed abstract class Either3[+A, +B, +C] extends Pro"""Bayesian Gaussian Mixture Modduct with Serializable {
def fold[Z](left: A => Z, middle: B => Z, right: C => Z): Z = this match {
case Left3(a) => left(a)
caseel."""
# Author: Wei Xue <xuewei4d Middle3(b) => middle(b)
case Right3(c) => right(c)
}

def eitherLeft: (A \/ B) \/ C = this match {
case Left3(a) => -\@gmail.com>
# Thierry Guillemot <thierry.guillemot.work@gmail.com>
# License: BSD 3 clause

import math
import numpy as np
from scipy.special import betaln, digamma, /(-\/(a))
case Middle3(b) => -\/(\/-(b))
case Right3(c) => \/-(c)
}

gammaln

from .base import BaseMixture, _check_shape
from .gaussian_mixture import _check_precision_matrix
from .gaussian_mixture import _check_precision_positivity
from .gaus def eitherRight: A \/ (B \/ C) = this match {
case Left3(a) => -\/(a)
case Middle3(b) => \/-(-\/(b))
case Right3(c)sian_mixture import _compute_log_det_cholesky
from .gaussian_mixture import _compute_precision_cholesky
from .gaussian_mixture import _estimate_gaussian_p => \/-(\/-(c))
}

def leftOr[Z](z: => Z)(f: A => Z): Z = fold(f, _ => z, _ => z)
def middleOr[Z](zarameters
from .gaussian_mixture import _estimate_log_gaussian_prob
from ..utils import check_array
from ..utils.validation import check_is_fitted


def _log_dirichlet_norm(dirichlet_concentration: => Z)(f: B => Z): Z = fold(_ => z, f, _ => z)
def rightOr[Z](z: => Z)(f: C => Z): Z = fold(_ => z, _ => z, f)
}

final case class Left3[+A, +B, +C](a: A) extends Either3[A, B, C]
final case cla):
"""Compute the log of the Dirichlet distribution normalization term.

Parameters
----------
dirichletss Middle3[+A, +B, +C](b: B) extend_concentration : array-like, shape (n_samples,)
The s Either3[A, B, C]
final case class Right3[+A, +B, +C](c: parameters values of the Dirichlet distribution.

Returns
-------
log_dirichlet_norm : float
The log normalization of the DirichleC) extends Either3[A, B, C]

object Either3 {
def left3[A, B, C](a: A): Either3[A, B, C] = Left3(a)
def middle3[A, B, C](b: B)t distribution.
"""
return (gammaln(np.sum(dirichlet_concentration)) -
np.sum(gammaln(dirichlet_concentration)))


def _log_wishart_norm(degrees_o: Either3[A, B, C] = Middle3(b)