TensorFlow Speed: Build from Source and Hand-Built GRU
I recently compared the performance of various RNN cells in a word prediction application. While doing this, I noticed that the Gated Recurrent Unit (GRU) ran slower per epoch than the LSTM cell. Based on the computation graphs for both cells, I expected the GRU to be a bit faster (also confirmed in literature). I launched an investigation into runtimes, including building TensorFlow from source, hand-building a GRU, laptop vs. Amazon EC2 GPU, and feeddict vs. QueueRunner.
Here is a brief summary of the test case I used for all the below benchmarks:
embedding layer width = 64
rnn width = 192
rnn sequence length = 20
hidden layer width = 96 (fed by final RNN state)
learning rate = 0.05, momentum = 0.8 (SGD with momentum)
batch size = 32
epochs = 2 (~250k examples per epoch)
Build TensorFlow from Source
I suspect many people ignore these warnings at TensorFlow startup:
The TensorFlow library wasn't compiled to use SSE instructions, but these are available on your machine and could speed up CPU computations.
The TensorFlow library wasn't compiled to use SSE2 instructions, but these are available on your machine and could speed up CPU computations.
The TensorFlow library wasn't compiled to use SSE3 instructions, but these are available on your machine and could speed up CPU computations.
The TensorFlow library wasn't compiled to use SSE4.1 instructions, but these are available on your machine and could speed up CPU computations.
The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations.
The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations.
The TensorFlow library wasn't compiled to use AVX2 instructions, but these are available on your machine and could speed up CPU computations.
The TensorFlow library wasn't compiled to use FMA instructions, but these are available on your machine and could speed up CPU computations.
The TensorFlow Performance Guide recommends building from source, but gives no benchmarks on speed improvement. I rarely go through the effort of building from source, and never on a complex piece of software like TensorFlow. I decided to give it a shot, and honestly, it wasn’t very difficult.
The speed improvement was 1.8x on a CPU (noted as ‘laptop’ in the table):
pip (laptop) | source (laptop) | pip (GPU) | source (GPU) | |
---|---|---|---|---|
GRUCell | 390s | 258s | 240s | 228s |
Hand-Built GRU | 339s | 187s | 210s | 209s |
Table 1. Comparison of run times. laptop = Intel i5 w/ Linux. GPU = AWS EC2 p2.xlarge instance (1 x NVIDIA K80). pip = TensorFlow install with pip. source = Build from source using bazel
.
The speed improvement on a GPU machine was negligible. The CUDA and cuDNN libraries from NVIDIA are already compiled and optimized for their GPU.
Hand-Built vs. GRUCell + dynamic_rnn
As you probably noted above, the hand-built GRU provided an additional performance boost on top of compiling for your native CPU.
CPU | AWS EC2 GPU | |
---|---|---|
GRUCell | 258s | 228s |
Hand-Built GRU | 187s | 209s |
Table 2. 1.4x improvement by hand-crafting a GRU-based RNN vs. using tf.contrib.rnn.GRUCell
and tf.nn.dynamic_rnn
.
Here is the code for a hand-built GRU:
def init_rnn_cell(x, num_cells, batch_size):
"""Initialize variables"""
i_sz = x.shape[1]+num_cells
o_sz = num_cells
with tf.variable_scope('GRU'):
Wr = tf.get_variable('Wr', (i_sz, o_sz), tf.float32, vsi_initializer)
Wz = tf.get_variable('Wz', (i_sz, o_sz), tf.float32, vsi_initializer)
W = tf.get_variable('W', (i_sz, o_sz), tf.float32, vsi_initializer)
br = tf.get_variable('br', o_sz, tf.float32, one_initializer)
bz = tf.get_variable('bz', o_sz, tf.float32, one_initializer)
b = tf.get_variable('b', o_sz, tf.float32, zero_initializer)
h_init = tf.get_variable('h_init', (batch_size, o_sz), tf.float32, zero_initializer)
return h_init
def cell(x, h_1):
"""GRU, eqns. from: http://arxiv.org/abs/1406.1078"""
with tf.variable_scope('GRU', reuse=True):
Wr = tf.get_variable('Wr')
Wz = tf.get_variable('Wz')
W = tf.get_variable('W')
br = tf.get_variable('br')
bz = tf.get_variable('bz')
b = tf.get_variable('b')
xh = tf.concat([x, h_1], axis=1)
r = tf.sigmoid(tf.matmul(xh, Wr) + br) # Eq. 5
rh_1 = r * h_1
xrh_1 = tf.concat([x, rh_1], axis=1)
z = tf.sigmoid(tf.matmul(xh, Wz) + bz) # Eq. 6
h_tild = tf.tanh(tf.matmul(xrh_1, W) + b) # Eq. 8
h = z*h_1 + (1-z)*h_tild # Eq. 7
return h
# 20-step RNN (config.rnn_size == 20)
s = [init_rnn_cell(embed_out[:, 0, :], config.rnn_size, config.batch_size)]
for i in range(config.num_rnn_steps):
s.append(cell(embed_out[:, i, :], s[-1]))
Here is the same implementation using tf.contrib.rnn.GRUCell
and tf.nn.dynamic_rnn
:
# 20-step RNN (config.rnn_size == 20)
rnn_cell = tf.contrib.rnn.GRUCell(config.rnn_size, activation=tf.tanh)
rnn_out, state = tf.nn.dynamic_rnn(rnn_cell, embed_out, dtype=tf.float32)
Once you’ve built your GRU cell, it is three lines of code to instantiate your custom RNN vs. two lines using tf.contrib.rnn.GRUCell
and tf.nn.dynamic_rnn
. Clearly, the off-the-shelf RNN cells are easily configured and less prone to error. TensorFlow is all about reducing “cognitive load” so you can focus on designing your networks. But a 1.4x performance hit is a steep price to pay.
What’s the Problem: tf.contrib.rnn.GRUCell
or tf.nn.dynamic_rnn
?
I didn’t see any issues with the tf.contrib.rnn.GRUCell
implementation. So I suspected that speed hit was from tf.nn.dynamic_rnn
. So I ran a quick test comparing the speed running using tf.nn.dynamic_rnn
to tf.contrib.rnn.static_rnn
.
dynamic_rnn
= 258sstatic_rnn
= 202s
So, using the dynamic_rnn accounted for most of the speed difference.
Here is the modified code to use static_rnn
:
# 20-step RNN (config.rnn_size == 20)
rnn_cell = tf.contrib.rnn.GRUCell(config.rnn_size, activation=tf.tanh)
inputs = [embed_out[:, i, :] for i in range(config.num_rnn_steps)]
rnn_out, state = tf.contrib.rnn.static_rnn(rnn_cell, inputs, dtype=tf.float32)
feeddict vs. QueueRunner
I did a quick check to the speed improvement from using a QueueRunner versus a feeddict. I only ran this experiment on the AWS EC2 GPU instance. The performance improvement was very slight:
QueueRunner
= 222sfeeddict
= 228s
It was easy to store all the data on the GPU for this benchmark. So, it isn’t surprising that the performance is about the same. I was never able to get the GPU utilization over 60% in any scenario, trying several different queue architectures. Training a convolutional net on image data would surely be a different story.
Laptop vs. Amazon EC2 GPU
Now, this was quite disappointing, especially since I had been paying for GPU compute time with Amazon. The p2.xlarge instance was faster than my laptop, but this was only because I hadn’t compiled TensorFlow for my native CPU microarchitecture. Now it seems my laptop beats out the GPU instance (Table 2).
Again, I’m sure the story would be entirely different with a deep convolutional net (CNN) trained on image data. I’ll be turning my attention to deep CNNs soon.
Summary
I think the takeaways are simple:
- Build TensorFlow for your machine
- If you don’t need the features of
dynamic_rnn
, then usestatic_rnn
for a speed boost.
As always, I hope you found this post helpful. Please comment with questions, corrections or ideas for future posts.