I recently compared the performance of various RNN cells in a word prediction application. While doing this, I noticed that the Gated Recurrent Unit (GRU) ran slower per epoch than the LSTM cell. Based on the computation graphs for both cells, I expected the GRU to be a bit faster (also confirmed in literature). I launched an investigation into runtimes, including building TensorFlow from source, hand-building a GRU, laptop vs. Amazon EC2 GPU, and feeddict vs. QueueRunner.
Here is a brief summary of the test case I used for all the below benchmarks:
embedding layer width = 64 rnn width = 192 rnn sequence length = 20 hidden layer width = 96 (fed by final RNN state) learning rate = 0.05, momentum = 0.8 (SGD with momentum) batch size = 32 epochs = 2 (~250k examples per epoch)
Build TensorFlow from Source
I suspect many people ignore these warnings at TensorFlow startup:
The TensorFlow Performance Guide recommends building from source, but gives no benchmarks on speed improvement. I rarely go through the effort of building from source, and never on a complex piece of software like TensorFlow. I decided to give it a shot, and honestly, it wasn’t very difficult.
The speed improvement was 1.8x on a CPU (noted as ‘laptop’ in the table):
|pip (laptop)||source (laptop)||pip (GPU)||source (GPU)|
Table 1. Comparison of run times. laptop = Intel i5 w/ Linux. GPU = AWS EC2 p2.xlarge instance (1 x NVIDIA K80). pip = TensorFlow install with pip. source = Build from source using
The speed improvement on a GPU machine was negligible. The CUDA and cuDNN libraries from NVIDIA are already compiled and optimized for their GPU.
Hand-Built vs. GRUCell + dynamic_rnn
As you probably noted above, the hand-built GRU provided an additional performance boost on top of compiling for your native CPU.
|CPU||AWS EC2 GPU|
Table 2. 1.4x improvement by hand-crafting a GRU-based RNN vs. using
Here is the code for a hand-built GRU:
Here is the same implementation using
Once you’ve built your GRU cell, it is three lines of code to instantiate your custom RNN vs. two lines using
tf.nn.dynamic_rnn. Clearly, the off-the-shelf RNN cells are easily configured and less prone to error. TensorFlow is all about reducing “cognitive load” so you can focus on designing your networks. But a 1.4x performance hit is a steep price to pay.
What’s the Problem:
I didn’t see any issues with the
tf.contrib.rnn.GRUCell implementation. So I suspected that speed hit was from
tf.nn.dynamic_rnn. So I ran a quick test comparing the speed running using
So, using the dynamic_rnn accounted for most of the speed difference.
Here is the modified code to use
feeddict vs. QueueRunner
I did a quick check to the speed improvement from using a QueueRunner versus a feeddict. I only ran this experiment on the AWS EC2 GPU instance. The performance improvement was very slight:
It was easy to store all the data on the GPU for this benchmark. So, it isn’t surprising that the performance is about the same. I was never able to get the GPU utilization over 60% in any scenario, trying several different queue architectures. Training a convolutional net on image data would surely be a different story.
Laptop vs. Amazon EC2 GPU
Now, this was quite disappointing, especially since I had been paying for GPU compute time with Amazon. The p2.xlarge instance was faster than my laptop, but this was only because I hadn’t compiled TensorFlow for my native CPU microarchitecture. Now it seems my laptop beats out the GPU instance (Table 2).
Again, I’m sure the story would be entirely different with a deep convolutional net (CNN) trained on image data. I’ll be turning my attention to deep CNNs soon.
I think the takeaways are simple:
- Build TensorFlow for your machine
- If you don’t need the features of
dynamic_rnn, then use
static_rnnfor a speed boost.
As always, I hope you found this post helpful. Please comment with questions, corrections or ideas for future posts.