# Text Generation With Keras char-RNNs

I recently bought a deep learning rig to start doing all the cool stuff people do with neural networks these days. First on the list - because it seemed easiest to implement - text generation with character-based recurrent neural networks.

watercooling, pretty lights and 2 x GTX 1080 (on the right)

This topic has been widely written about by better people so if you don’t already know about char-RNNs go read them instead. Here is Andrej Karpathy’s blog post that started it all. It has an introduction to RNNs plus some extremely fun examples of texts generated with them. For an in depth explanation of LSTM (the specific type of RNN that everyone uses) I highly recommend this.

I started playing with LSTMs by copying the example from Keras, and then I kept adding to it. First - more layers, then - training with generators instead of batch - to handle datasets that don’t fit in memory. Then a bunch of scripts for getting interesting datasets, then utilities for persisting the models and so on. I ended up with a small set of command line tools for getting the data and running the experiments that I thought may be worth sharing. Here it is.

### And here are the results

A network with 3 LSTM layers 512 units each + a dense layer trained on the trained for a week on the concatenation of all java files from the hadoop repository produces stuff like this:

That’s pretty believable java if you don’t look too closely! It’s important to remember that this is a character-level model. It doesn’t just assemble previously encountered tokens together in some new order. It hallucinates everything from ground up. For example setSchedulerAppTestsBufferWithClusterMasterReconfiguration() is sadly not a real function in hadoop codebase. Although it very well could be and it wouldn’t stand out among all the other monstrous names like RefreshAuthorazationPolicyProtocolServerSideTranslatorPB. Which was exactly the point of this exercise.

Sometimes the network decides that it’s time for a new file and then it produces the Apache software licence varbatim followed by a million lines of imports:

At first I thought this was a glitch, a result of the network is getting stuck in one state. But then I checked and - no, actually this is exactly what those hadoop .java files look like, 50 lines is a completely unexceptional amount of imports. And again, most of those imports are made up.

And here’s a bunch of python code dreamt up by a network trained on scikit-learn repository.

This is much lower quality because the network was smaller and sklearn’s codebase is much smaller than that of hadoop. I’m sure there is a witty comment about the quality of code in those two repositories somewhere in there.

And here’s the result of training on the scalaz repository:

In equal measure elegant and incomprehensible. Just like real scalaz.

Enough with the github. How about we try some literature? Here’s LSTM-generated Jane Austen:

“I wish I had not been satisfied with the other thing.”

“Do you think you have not the party in the world who has been a great deal of more agreeable young ladies to be got on to Miss Tilney’s happiness to Miss Tilney. They were only to all all the rest of the same day. She was gone away in her mother’s drive, but she had not passed the rest. They were to be already ready to be a good deal of service the appearance of the house, was discouraged to be a great deal to say, “I do not think I have been able to do them in Bath?”

“Yes, very often to have a most complaint, and what shall I be to her a great deal of more advantage in the garden, and I am sure you have come at all the proper and the entire of his side for the conversation of Mr. Tilney, and he was satisfied with the door, was sure her situation was soon getting to be a decided partner to her and her father’s circumstances. They were offended to her, and they were all the expenses of the books, and was now perfectly much at Louisa the sense of the family of the compliments in the room. The arrival was to be more good.

That was ok but let’s try something more modern. And what better represents modernity than people believing that the earth is flat. I have scraped all the top level comments from top 500 youtube videos matching the query “flat earth”. Here is the comments scraper I made for this. And here is the neural network spat out after ingesting 10MB worth of those comments

That doesn’t make any sense at all. It’s so much like real youtube comments it’s uncanny.