7

I am working on a reinforcement learning task and decided to use keras NN model for Q value approximation. The approach is common: after each action the reward is stored in a memory replay array, then I take random sample from it and fit the model with new data state-action => reward+predicted_Q(more details here). In order to do the training the Q value has to be predicted for each item in the training set.

The script is running very slow so I started investigating. Profiling shows that 56,87% of cumulative time is taken by _predict_loop method: enter image description here And it looks strange, cause prediction is just a one-way propagation. Just a one-time multiplication of set of numbers. The model I am using is very simple: 8 inputs, 5 nodes on hidden layer, 1 output.

I have installed and configured CUDA, run few example tests and it shows that GPU is used, also I can see huge load of GPU. When I run my code - there is a message: "Using gpu device 0: GeForce GT 730" but I can see that GPU load is really low(about 10%).

Is it normal for predict function to take so much time? Is there a way to use GPU for this computation?

talonmies
  • 70,661
  • 34
  • 192
  • 269
Serhiy
  • 4,357
  • 5
  • 37
  • 53

3 Answers3

9

It seems the size of your NN is much too small to fully utilize the GPU. Typically GPU is faster than multi-core CPU only when the input/hidden/output layer size is larger than 200~500 (depending on the implementation code).

However the size of your NN is only 8/5/1, which means most of the time is spent on GPU overhead such CUDA kernel launching, PCIe data transfer, etc. In this case, the number of calls is the main factor that determines the training time. To speed up, you probably need to train your model on CPU, and with a programming language such as C/C++ that has much lower overhead.

kangshiyin
  • 9,681
  • 1
  • 17
  • 29
  • Thanks Eric, this is very helpful. Now it's clear why GPU doesn't help. But on CPU the it is almost as slow. So there is still a question about predict method performance. Is there a chance I am using Keras in a wrong way? Or in my scenario Keas is not an option? – Serhiy Jun 25 '16 at 15:29
  • @Serhiy Maybe there's something wrong. Are you sure that you need train you small NN(8/5/1) with iterations/samples/mini-batches as large as 521080? – kangshiyin Jun 25 '16 at 15:39
  • I am not. I have small state space: 5 variables. I have 8 actions to take at each step. First I tried to train a model with 8 outputs as it is suggested here: outlace.com/Reinforcement-Learning-Part-3/, but that case had 4 actions. In my case each time I am training a model 7 out of 8 actions will be random values. The learning is not efficient, all output are close to average value. So I decided to have a separate NN model for each action. After each step I add (action, state, reward) to experience array. Then take 50 items from there and for each item predict best Q to fit the model. – Serhiy Jun 25 '16 at 16:12
  • 521080 is not the number of training array. It's a number of times I did the prediction. After each step I take 50 items from memory array(to avoid catastrophic forgetting), for each item I predict Q value for each possible action. So predict is called 50*8 times after each step. The model is trained with 50 items dataset after each step. – Serhiy Jun 25 '16 at 16:14
  • @Serhiy you could try to run the original example to see if they need repeat a single operation so many time for so small NN. – kangshiyin Jun 25 '16 at 16:23
  • They calculate prediction 50 times and I do it 50*8. They have 4 actions and use 4 output neurons to predict Q value for each action taken in current state. I can't do the same because learning is not efficient in that case(I tried it). So I predict Q value for each state+action combination, that is 50*8 – Serhiy Jun 25 '16 at 16:47
  • @Serhiy ok. when you say it is slow and not efficient, are you over-expecting the speed? Do you see some example that can do even faster? – kangshiyin Jun 25 '16 at 16:51
  • Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/115598/discussion-between-serhiy-and-eric). – Serhiy Jun 25 '16 at 16:55
1

It's well know issue. That's why we have CNMeM. This is a library, developed by NVIDIA, which helps deep learning frameworks managing CUDA memory. CNMeM is already integrated in Theano so you don’t have to install anything. To enable CNMeM in Theano, you have to add to .theanorc the lines:

[lib]
cnmem = 0.8

The cnmem value specifies the amount of GPU memory allocated for Theano. To quote the documentation:

0: not enabled.

0 < N <= 1: use this fraction of the total GPU memory (clipped to .95 for driver memory). [Note: This should be a float value, for instance 0.25 or 1.0]

. >1: use this number in megabytes (MB) of memory.

For more information, instrucion about .theanorc and CuDNN (another usefull lib) please visit:

http://ankivil.com/making-theano-faster-with-cudnn-and-cnmem-on-windows-10/

Byte_me
  • 111
  • 5
0

Your model is really small so you could also run the inference on the CPU and try OpenVINO for better performance. OpenVINO is optimized for Intel hardware but it should work with any processor. It optimizes your model by converting to Intermediate Representation (IR), performing graph pruning and fusing some operations into others while preserving accuracy.

Here are for various models and CPUs.

It's rather straightforward to convert your Keras model to OpenVINO. The full tutorial on how to do it can be found here. Some snippets below.

Install OpenVINO

The easiest way to do it is using PIP. Alternatively, you can use this tool to find the best way in your case.

pip install openvino-dev[tensorflow2]

Save your model as SavedModel

OpenVINO is not able to convert HDF5 model, so you have to save it as SavedModel first.

import tensorflow as tf
from custom_layer import CustomLayer
model = tf.keras.models.load_model('model.h5', custom_objects={'CustomLayer': CustomLayer})
tf.saved_model.save(model, 'model')

Use Model Optimizer to convert SavedModel model

The Model Optimizer is a command-line tool that comes from OpenVINO Development Package. It converts the Tensorflow model to IR, which is a default format for OpenVINO. You can also try the precision of FP16, which should give you better performance without a significant accuracy drop (just change data_type). Run in the command line:

mo --saved_model_dir "model" --input_shape "[1, 3, 224, 224]" --data_type FP32 --output_dir "model_ir"

Run the inference

The converted model can be loaded by the runtime and compiled for a specific device e.g. CPU or GPU (integrated into your CPU like Intel HD Graphics). If you don't know what is the best choice for you, just use AUTO.

# Load the network
ie = Core()
model_ir = ie.read_model(model="model_ir/model.xml")
compiled_model_ir = ie.compile_model(model=model_ir, device_name="CPU")

# Get output layer
output_layer_ir = compiled_model_ir.output(0)

# Run inference on the input image
result = compiled_model_ir([input_image])[output_layer_ir]

Disclaimer: I work on OpenVINO.

dragon7
  • 1,057
  • 9
  • 23