Updated 12 SEPT 2017: Improved the training procedure and rewrote parts of post
Since ancient times, it has been known that machines excel at math while humans are pretty good at detecting cats in pictures. But with the advent of deep learning, the boundaries have started to blur…
Today we want to teach the machines to do math - again! But instead of feeding them with optimized, known representations of numbers and calling hard-wired operations on them, we will feed in strings representing math formulas along with strings representing their results, character by character, and have the machine figure out how to interpret them and arrive at the results on its own.
The resulting model is much more human: Not all results are exact, but they are close. It’s more of an approximation than an exact calculation. As a 100% human myself, I surely can relate to almost-correct math, if anything.
Today’s post will be much more practical than usual, so if you want to build this at home, get your long-neglected Pythons out of their cages! You can find the complete code with lots of improvements here.
Recurrent Neural Networks and Long-Short-Term-Memory
As in my previous post, we’re going to use Recurrent Neural Networks, or RNN for short. More specifically, a kind of RNN known under the fancy name of Long-Short-Term-Memory, or LSTM. If you need a primer on these, I can recommend Christopher Olah’s “Understanding LSTMs” and again Andrej Karpathy’s “The Unreasonable Effectiveness of Recurrent Neural Networks”. But to summarize: RNNs are essentially building blocks of Neural Networks that not only look at the current input, but also remember the input before. LSTMs especially have a complex memory mechanism that can learn which parts of the data are important to remember, which can be ignored, and which can be forgotten.
Sequence to Sequence Learning
Sequence to sequence learning deals with problems in which a source sequence of inputs has to be mapped to a target sequence of outputs where each output is not necessarily directly dependent on a single input. The classical example is translation. How do you learn that a Chinese input phrase “他现在已经在路上了。” equals “She is on her way.” in English?
Even with RNNs it’s not directly obvious how to do this. Well, it wasn’t until Google published their paper “Sequence to Sequence Learning with Neural Networks” (Seq2Seq) in 2014. The idea is to train a joint encoder-decoder model. First, an encoder based on RNNs learns an abstract representation. Then a decoder also based on RNNs learns to decode it to another language, generating a new sequence from the encoding as output.
For most problems, data is hard to come by. Labeled data even more so. But math
equations are cheap: They can be easily generated, and Python gives us their
result simply by calling
Starting with simple addition of small, natural numbers, we can easily generate
a lot of equations along with their results and train a Seq2Seq model on it.
For example, a typical datapoint for addition could look like this:
input = '31 + 87' output = '118'
Since we’re learning on just a fraction of all possible formulas, the model can’t just learn the results by heart. Instead, in order to generalize to all other equations, it really needs to “understand” what addition “means”. This could include, among others:
- An equation consists of numbers and operations
- Two numbers are added up digit-by-digit
- If two digits add up to a value higher than 9, they carry over to the next
a + b = b + a
- Adding zero does not actually do anything:
a + 0 = a
And this just for additions! If we add more operations, or mix them, the network needs to grog even more rules like this.
Setting up your Environment
If you’re doing this from scratch, you will want to work in a new virtualenv.
I also recommend using Python 3, because it’s really time we got over Python
2… If you don’t know how virtualenv works, I really recommend looking into
them. You can find a short tutorial
here. To get a
virtualenv with Python 3, try
virtualenv -p python3 venv.
Once your virtualenv is activated, install these requirements:
pip install numpy tensorflow keras
If you run into trouble with Tensorflow, follow their guide.
Great! You can paste the code below into a Python shell, or store it in a file, it’s up to you.
Generating Math Formulas as Strings
For now, let’s build some equations! We’re going to go real simple here and just work with easy-peasy addition of two small natural numbers. Once we get that to work, we can build more funky stuff later on. See the full code to find out how to build more complex equations.
To make sure we build each possible equation only once, we’re going to use a
generator. With the handy
itertools Python standard package, we can generate
all possible permutations of two numbers and then create a formula for each
pair. Here’s our basic code. Note that I’m referencing some global variables
(IN ALL CAPS), we will define them later.
Assuming global config values
MIN_NUMBER = 0 and
MAX_NUMBER = 999,
this code will generate us equations like these:
'89 + 7' '316 + 189' '240 + 935'
For any of these equation strings, we can easily get the result in Python using
That was easy, right?
Encoding Strings for Neural Networks
Remember we want to look at the strings as sequences. The RNN will not see the input string as a whole, but one character after the other. It will then have to learn to remember the important parts in its internal state. That means we need to convert each input and output string into vectors first. Let’s also add an “end” character to each sequence so the neural network can learn when a sequence is finished. I’m using the (invisible) ‘\0’ character here.
We do this by encoding each character as a one-hot vector. Each string is then simply a matrix of these character-vectors. The way we encode is quite arbitrary, as long as we decode it the same way later on. Here’s some code I used for this:
With these helper functions, it is quite easy to generate the dataset:
And later to print out some examples along with their targets:
It is important to note here that the training data uses a max length for the input and output sequences only for numerical reasons. The model is not limited by that length and could look at or output longer sequences after this training.
Building the Model in Keras
Now, let’s build the model. Thanks to Keras, this is quite straightforward.
First, we need to decide what the shape of our inputs is supposed to be. Since
it’s a matrix with a one-hot-vector for each position in the equation string,
this is simply
(MAX_EQUATION_LENGTH, N_FEATURES). We’ll pass that input to a
first layer consisting of 256 LSTM cells. Each will look at the input,
character by character, and output a single value. We also add some light
dropout for regularization.
Now, ideally we’d use the activation of each LSTM for each of the sequence
items and then use them as inputs to our decoder. It seems Keras is having
a hard time with this, so the accepted workaround is to let the LSTM cells
output single values each, and then repeat these vectors using
to feed the decoder network. There’s an issue on Keras discussing this
So after we repeat the encoded vector
n times with
n being the (maximum)
length of our output sequences, we run this repeat-sequence through the
decoder: An LSTM layer that will output sequences of vectors. Finally, we want
to combine each LSTM cell’s output at each point in the sequence to a single
output vector. This is done using Keras’
TimeDistributed wrapper around a
Since we expect something like a one-hot vector for each output character,
we still need to apply
softmax as usual in classification problems. This
essentially yields us a probability distribution over the character classes.
Note that from what I gathered, this way of building Seq2Seq models in Keras is not optimal and not exactly equivalent to what is proposed in the paper. It works nonetheless. For a more correct implementation, try out Fariz Rahman’s seq2seq package.
Either way, here is our code:
Training is easy. We just call the fit function of the model, passing handy
callbacks like the
ModelCheckpoint, which stores the best model after each
epoch. Here’s the main function of our code:
Finally, we need to fill in some of the global variables we used throughout the code. With equations using two numbers from 0 to 999, there’s 998k possible data points. Let’s use about a sixteenth of that for our data, and a quarter of that for testing, meaning we’ll train on ~ 50k data points and validate on ~ 14k different data points. Here’s my config:
Finally, we can start training. Either call
main() from the shell, or store
everything in a file
training.py and run it via
Running the code as it is described here, with some patience I get to a test
0.999 on the test set after about 60 epochs:
That’s great, but still not perfect! Did our model not generalize to all areas of the problem space?
Analyzing the Mistakes
Let me preface this by saying that the problems in which Deep Learning really shines are all concerned with “natural” data, like speech or images. In these fields, a bit of noise can be perfectly acceptable, the same way that humans do not achieve a perfect performance in anything (sorry). This is math though, and it would be a nice proof of the power of LSTM models if we could learn something that requires 100% exactness.
So, let’s browse some of the examples where the model failed to give a correct
answer. I see some interesting ones where we get correct representations but
small errors in the space of numbers, e.g.
800 + 890 = 1689 (expected: 1690).
It seems like the network did not follow the strategy of adding up number by
number as I (a biased human) expected, but rather goes for a more
“instinctive” adding up.
Even more interesting are the failures where things seem to have gone
completely off the hook:
978 + 20 = 008 (expected: 998). What happened here?
It seems that part of the model thought the result should be
another part thought it should be
998. It’s almost as if each position in
the output sequence follows its own stubborn logic, not really caring if the
number as a whole makes any sense. This is probably a weakness of LSTMs with
these kinds of data and would require some research to solve.
Anyway, since our input space has only two factors of variation (the two numbers that go into building the equation), we can plot the equation space in a 2D pane. So I went ahead and created a scatter plot with green dots marking correct predictions and red dots marking incorrect ones. Find the code for that here.
It seems errors appear mostly in the lower bounds, where one of the numbers is below 100 or even below 10. Here, the rules of adding up numbers change a little bit (the first digit no longer counts the 100s). To increase the trouble, this problematic space has less training examples than the easier parts.
We might solve this problem by introducing sample weights or reversing the strings (so that the order of digits in numbers is more deterministic). I’ll leave it up to you to work on this!
There’s lots to do! You can get creative about getting over this lower-number
problem. We can also increase the complexity of the equations. I was able to get
quite far on ones as complex as
131 + 83 - 744 * 33 (= -24338), but haven’t
really gotten it to work with division. But that’s probably just a matter of
tweaking the model or the training process.
Feel free to pass on hints, ideas for improvement, or your own results in the comments or as issues on my repository.