53

In Keras, the high-level deep learning library, there are multiple types of recurrent layers; these include LSTM (Long short term memory) and CuDNNLSTM. According to the Keras documentation, a CuDNNLSTM is a:

Fast LSTM implementation backed by CuDNN. Can only be run on GPU, with the TensorFlow backend.

It is my belief that Keras automatically uses the GPU wherever possible. According to the TensorFlow build instructions, to have a working TensorFlow GPU backend, you will need CuDNN:

The following NVIDIA software must be installed on your system:

  • NVIDIA's Cuda Toolkit (>= 7.0). We recommend version 9.0. For details, see NVIDIA's documentation. Ensure that you append the relevant Cuda pathnames to the LD_LIBRARY_PATH environment variable as described in the NVIDIA documentation.
  • The NVIDIA drivers associated with NVIDIA's Cuda Toolkit.
  • cuDNN (>= v3). We recommend version 6.0. For details, see NVIDIA's documentation, particularly the description of appending the appropriate pathname to your LD_LIBRARY_PATH environment variable.

Therefore, how would a CuDNNLSTM differ in any way from a normal LSTM using a TensorFlow GPU backend? Will CuDNNLSTM be automatically selected and replace the normal LSTM when an available TensorFlow GPU backend is found?

krismath
  • 1,879
  • 2
  • 23
  • 41
  • I guess they are the same? It probably only differs when it is run without a GPU. – TYZ Apr 23 '18 at 20:03
  • 3
    Choice of LSTM <-> CuDNNLSTM is important if you are going to deploy the model into production. For example, Google Cloud Platform allows you to use only CPU machines in their "AI Platform" so far. So, if you train the model with CuDNNLSTM, you won't be able to deploy it. – Vlad-HC Oct 21 '19 at 11:55

5 Answers5

32

Why don't you try it out for yourself and see? In my case, training a model with LSTM took 10mins 30seconds. Simply switching the call from LSTM() to CuDNNLSTM() took less than a minute.

I also noticed that switching to CuDNNLSTM() speeds up model.evaluate() and model.predict() substantially as well.

cryanbhu
  • 4,780
  • 6
  • 29
  • 47
  • 6
    Why am I just finding out about this, it's amazing! It used to take me 3h to evaluate a model on a large dataset, now it only takes about 20mins. – aL_eX Sep 07 '18 at 15:08
  • 10
    CuDNNLSTM is faster (it uses the GPU support) but it has less options than LSTM (dropout for example) – lbcommer Dec 27 '18 at 16:20
  • see this thread for more info: https://www.reddit.com/r/learnmachinelearning/comments/9jv0gx/what_is_the_difference_between_cudnnlstm_lstm/ – Arayan Singh May 04 '19 at 12:37
  • 1
    Just re-enforcing what's been said. Running a model with 10 LSTM layers, 3 dense layers and 200 neurons per layer with a hard sigmoid activation function forced use of the generic GPU - >40 minutes per epoch. Changing that back to a plain sigmoid permitted the cuDNN - 54 seconds per epoch. Makes a huge difference. – Jeremy Slater Jan 21 '21 at 23:54
  • I just tried it on Python 3.9.9 and Tensorflow 2.7.0. I replaced tf.keras.layers.LSTM(128, return_sequences=True) with tf.compat.v1.keras.layers.CuDNNLSTM(128, return_sequences=True), and I got more than 3x speed improvement, from 110ms per step to 38ms. Running on a RTX 3080Ti, lower end cards may see a greater performance improvement. Thus, what the official documentation said about tensorflow 2.x already using CuDNN kernel is NOT true, at least it is not working for me. – Billy Cao Dec 10 '21 at 08:56
16

In TensorFlow 2.0, the built-in LSTM and GRU layers have been updated to leverage CuDNN kernels by default when a GPU is available. With this change, the prior keras.layers.CuDNNLSTM/CuDNNGRU layers have been deprecated, and you can build your model without worrying about the hardware it will run on.

Since the CuDNN kernel is built with certain assumptions, this means the layer will not be able to use the CuDNN kernel if you change the defaults of the built-in LSTM or GRU layers.

Check the tensorflow RNN documentation : https://www.tensorflow.org/guide/keras/rnn

Sillians
  • 177
  • 2
  • 9
  • 1
    This is NOT true. I just tried it on Python 3.9.9 and Tensorflow 2.7.0. I replaced tf.keras.layers.LSTM(128, return_sequences=True) with tf.compat.v1.keras.layers.CuDNNLSTM(128, return_sequences=True), and I got more than 3x speed improvement, from 110ms per step to 38ms. Running on a RTX 3080Ti, lower end cards may see a greater performance improvement. Thus, what you said about tensorflow 2.x already using CuDNN kernel is not true, at least it is not working for me. – Billy Cao Dec 10 '21 at 08:56
  • 1
    Confirmed with Keras 2.4/TF 2.3 using an RTX 2060 Super - no speed up was observed using the CuDNNLSTM (via tf.compat.v1.keras.layers) vs. LSTM. In my case it was 24ms/step either way. – James_SO Dec 24 '21 at 19:50
13

TL;DR; The difference is 15x speed up in model training time!

Setup Steps

Dependencies

Performance Benchmark: Comparison of the standard test machines.
1 iteration of Training on 612235 samples.

keras.layers.LSTM Intel i5-4690 CPU only: 612235/612235 [==============================] - 3755s 6ms/step - loss: 2.7339 - acc: 0.5067 - val_loss: 2.1149 - val_acc: 0.6175

GTX:950 & Intel i5-4690: 612235/612235 [==============================] - 1417s 2ms/step - loss: 2.7007 - acc: 0.5137 - val_loss: 2.0983 - val_acc: 0.6199

2.5x gain with GPU.

GTX:970 & Intel i5-4690: 612235/612235 [==============================] - 1322s 2ms/step - loss: 1.9214 - acc: 0.6442 - val_loss: 1.8808 - val_acc: 0.6461

Ignorable gain with powerful GPU.

RTX 2070 & Intel i7-9700K: 612235/612235 [==============================] - 1012s 2ms/step - loss: 2.7268 - acc: 0.5111 - val_loss: 2.1162 - val_acc: 0.6234

Very minimal gain even with awesome HW upgrades!!!

keras.layers.CuDNNLSTM RTX 2070 & Intel i7-9700K: 612235/612235 [==============================] - 69s 112us/step - loss: 1.9139 - acc: 0.6437 - val_loss: 1.8668 - val_acc: 0.6469

54x gain over CPU!
15x gain over traditional(non Cuda) LSTM implementation!

Ricardo Gonzalez
  • 1,827
  • 1
  • 14
  • 25
  • 3
    it seems like only the last 2 cases need to be shown since they are a fair test - same components, only CuDNNLSTM vs LSTM is different.. The first few examples just make your post confusing. – cryanbhu Dec 01 '19 at 12:18
3

GPUs are good for massive parallel computation, most of the linear algebra ops can be parallelized to improve performance, Vector operations like matrix multiplication and gradient descent can be applied to large matrices that are executed in parallel with GPU support. CUDA - Compute Unified Device Architecture provides an interface that allows vector ops to take advantage of GPU parallelism. CuDNN implements kernels for large matrix operations on GPU using CUDA.

Here, CuDNNLSTM is designed for CUDA parallel processing and cannot run if there is no GPU. But LSTM is designed for normal CPUs. Faster time of execution is because of parallelism.

0

lbcommer's comment hits the nail on the head. Switching from an LSTM layer to cuDNNLSTM layer is much much faster, approx 10-20x so, but you lose some options, making it less versatile. Important options you lose include: masking, custom activation and dropout.

However, arguably, some of these properties can be introduced into the model with further layers/in other layers.

So if that matters, or you dont have a GPU, or deployment is a concern, stick to LSTM. Otherwise,cuDNNLSTM makes sense.

Also consider GRU for smaller datasets, as it is faster and more memory efficient. It only suffers accuracy issues as the dataset grows.

Also also... look at transformers which in keras are implemented through

tf.keras.layers.Attention()

These are also faster as all the inputs are ingested at once

DataMonkey
  • 104
  • 7