Torch vs TensorFlow vs Theano

with 2 Comments

For an ongoing project at CCRi, we wanted to determine whether remaining with Torch (used for Phase I of a project currently underway at CCRi running on GPUs) or switching to TensorFlow or Theano made the most sense for Phase II of the project. We ultimately found that TensorFlow’s combination of performance and usability made it the best choice as we move into Phase II.

As always with such tests, newer versions of any components used can make these results increasingly dated, but it was interesting to compare the current state of the art of the three frameworks.

All benchmarks were run on boxes using a single Pascal Titan X graphics card with CUDA 8 and version 5.1 of cudNN, NVIDIA’s CUDA Deep Neural Network library.

The github page Setting up a Deep Learning Machine from Scratch (Software) has good background on installing many of the tools described here.

 

Modeling Metrics

Half-precision floating point (fp16) support

Torch

  • Torch’s cunn library (a standard CUDA neural network backend of Torch) finished their support for fp16 computations recently.
  • Problem: doesn’t include Recurrent Neural Networks (RNNs) or even basic neural net layers such as Linear layers.

TensorFlow

  • Has fp16 storage support, but not fp16 computation at the moment. TensorFlow has a github ticket for this, but Google has been largely silent.

Theano

Note on fp16 on Pascal Titan X

According to the AnandTech article The NVIDIA GeForce GTX 1080 & GTX 1070 Founders Editions Review: Kicking Off the FinFET Generation, it is unclear how much fp16 support actually helps on a Pascal Titan X architecture. To evaluate this, we tested this on one of the boxes described above. Our evaluations observed both storage and speed performance using Torch7, with speed measured in samples / sec.

 

Model: VGG-16 with fully connected layers removed.
GPU: 1 Pascal Titan X
Drivers: Cuda 8 with cudNN 5.1

 

Summary: fp16 uses less RAM, but is slower per sample. The only reason I can speculate for when one would use fp16 on a Pascal Titan X is if the size of the model with a single batch was otherwise too big to fit in RAM.

 

batch size forward samples / sec forward + backward samples / sec max memory usage (Mb)
fp32 32 191 55 4269
fp16 32 149 38 2373
fp32 64 194 55 7981
fp16 64 151 38 4239
fp32 128 out of memory out of memory out of memory
fp16 128 150 36 7971

 

CNN Benchmarks

Summary: unless you’re Nervana, if you use cudNN everything is basically the same.

  • For CNN layers, we have the soumith convnet benchmarks, which are well documented and updated. Unfortunately, it doesn’t include Theano in the more recent benchmarks. I am not entirely sure what the fp16 benchmark mentioned there is measuring, exactly; my guess (since it is running on an older Titan X) is that it is simulated and hacked in instead of using native fp16 support.

LSTM Benchmarks

  • For RNN layers (incl LSTMs), there are the glample rnn benchmarks (which at this writing date back to May of 2016) using TensorFlow version 0.8 when 0.11 is currently available. We ran these tests on more recent software below.

 

Model: A single LSTM layer:

  • nn.SeqLSTM for Torch (updated version as of Nov 18, 2016)
  • tf.nn.rnn_cell.LSTMCell for TensorFlow 0.11
  • scan from Theano 0.8.2 (note: this version is compatible with cudNN 5, not cudNN 5.1, so there may be some problem there, but it still ran pretty quickly)
GPU: 1 Pascal Titan X
Drivers: Cuda 8 with cudNN 5.1

 

Summary:

  • Generally, for just the forward pass, Torch > Theano > TensorFlow.
  • For forward + backward, it seems that Theano > Torch > TensorFlow. Torch and Theano are generally about the same in this case except for smaller batch sizes with larger numbers of hidden units where Theano crushes Torch and TensorFlow.
  • As the batch size and hidden layer size grows, the difference between these frameworks shrinks. This is not surprising, as more of the work is being shelled out to cuda, which is the same across the board.

 

sequence length batch size hidden layer size forward samples / sec forward + backward samples / sec
Torch 30 32 128 22110 4849
TensorFlow 30 32 128 2778 1410
Theano 30 32 128 15462 5440
Torch 30 32 512 6722 1582
TensorFlow 30 32 512 2155 1285
Theano 30 32 512 7127 1874
Torch 30 32 1024 3618 864
TensorFlow 30 32 1024 1790 888
Theano 30 32 1024 4421 1143
Torch 30 128 128 74897 15131
TensorFlow 30 128 128 8656 5411
Theano 30 128 128 53953 14491
Torch 30 128 512 27781 7335
TensorFlow 30 128 512 6421 4238
Theano 30 128 512 23037 6514
Torch 30 128 1024 10524 3090
TensorFlow 30 128 1024 4753 2702
Theano 30 128 1024 9679 2751
Torch 60 32 128 11126 2364
TensorFlow 60 32 128 1353 879
Theano 60 32 128 5538 3092
Torch 60 32 512 3344 785
TensorFlow 60 32 512 1272 811
Theano 60 32 512 3951 1060
Torch 60 32 1024 1810 428
TensorFlow 60 32 1024 1009 467
Theano 60 32 1024 2339 613
Torch 60 128 128 37693 7575
TensorFlow 60 128 128 5278 3328
Theano 60 128 128 31076 8702
Torch 60 128 512 13966 3676
TensorFlow 60 128 512 4057 2691
Theano 60 128 512 12505 3649
Torch 60 128 1024 5248 1543
TensorFlow 60 128 1024 2695 1423
Theano 60 128 1024 4366 1409

 

Fluffy Metrics

Usability

Because these are developer tools, we reviewed usability in terms of the Python interfaces for TensorFlow and Theano and the Lua interface for Torch.

Writing Code

Generally, it seems like the ease or challenge in using either Torch or TensorFlow comes from the choice of language. Everyone seems to have Python experience nowadays, whereas Lua experience is rarer. Adding the lack of many basic functions in the Lua language raises the barrier to entry for new users picking up and coding in the environment. Theano requires a paradigm shift in thinking about how to write the code, which makes it more verbose and complicated in general.

The neural network libraries built on top of Torch (nn, rnn, …) and TensorFlow/Theano (Keras), however, seem to be roughly equivalent in terms of structure and therefore are expected to be equivalent in terms of barrier to entry for new users to begin constructing their own models.

Reading Code

With the exception of the raw Theano library, for pure readability it seems like both the raw frameworks and the neural network libraries built on top of them are relatively straightforward to read and understand what is going on, with small syntactic differences here and there, and other relatively confusing aspects that the user can just take on good faith are there for a good reason (“Why does collectgarbage() get called twice in a row here?”). Of course, if you only ever interact with Theano through the Keras library, then it doesn’t really matter how different raw Theano is.

Debugging

  • TensorFlow: Lots of tools. You can return whichever element of the graph you want and set multiple watchers on your tensorboard. (There’s also a TensorBoard for visualization and organization.)
  • Torch: Debugging can be done using standard debug tools. Breakpoints can be set at locations in your own code, and in library code, and variables can be inspected at each trigger.
  • Theano: My experience with this is not recent, but Theano has historically been known to be a pain to debug.

Moving to Production

My understanding is that any of these could be run in a docker, which probably makes for the easiest deployment. Aside from that, one of the biggest difficulties with Torch is that they don’t actually cut releases of any of their code, so your dependencies are “whatever copy of Torch I have now.”

Share this post: Facebooktwitterlinkedin
Follow CCRi:     Facebooktwitterlinkedinrss

2 Responses

  1. Michael Holroyd
    | Reply

    “We ultimately found that TensorFlow’s combination of performance and usability made it the best choice as we move into Phase II.”

    Seems like a surprising choice based on this data showing TensorFlow is slower than Torch or Theano, and significantly so in some cases. Was it primarily the community support / tooling / “debugging” aspects that pushed you toward TensorFlow?

    • Tim Emerick
      | Reply

      In the end, the decision came down to three things:
      – The tooling and support, as you mention.
      – The existence and frequency of regular release versions.
      – The note above about how the difference between frameworks for forward and backward is far less significant with larger batch and hidden layer sizes is really key. Our models tend to live in that space, so significant speed concerns with smaller batch and/or hidden layer sizes is less concerning for training in our case.

      The above comments apply to model training. For model use after training, there is still a significant speed concern even for larger layer and batch sizes. If this isn’t improved and becomes a significant bottleneck, we may need to revise our approach slightly. For example, building and training our models in TensorFlow so that we have access to all the tooling and the other TensorBoard metrics (which are more useful during model training than usage), then serializing the model and reloading it as a Theano model (switching between backends between training and usage is an operation that Keras largely supports). This would significantly improve our forward pass operations, but also increase deployment challenges.

Leave a Reply