Prediction with GPU is much slower than with CPU?

Question

curiously I just found out that my CPU is much faster for predictions. Doing inference with GPU is much slower then with CPU.

I have tf.keras (tf2) NN model with a simple dense layer:

input = tf.keras.layers.Input(shape=(100,), dtype='float32')
X = X = tf.keras.layers.Dense(2)(input)
model = tf.keras.Model(input,X)

#also initiialized with weights from a file
weights = np.load("weights.npy", allow_pickle=True )
model.layers[-1].set_weights(weights)

scores = model.predict_on_batch(data)

For 100 samples doing predictions I get:

2 s for GPU
0.07 s for CPU (!)

I am using a simple geforce mx150 with 2gb

I also tried the predict_on_batch(x) as someone suggested this as it is more faster than just predict. But here it is of same time.

Refer: Why does keras model predict slower after compile?

Has anyone an idea, what is going on there? What could be an issue possibly?

"As is this a simple 1 layer NN with tf.keras I think I do not need an example. I think it is a GPU related questions". You absolutely do need an example. Often times the problem in situations like these is something you did in calling the code. — mCoding, Dec 18 '20 at 18:04
There are hundreds of questions asking why this code runs slow in the GPU but fast in the CPU, and the answer is always the same, you are not putting enough load in the GPU (model is very small) to overcome communication between CPU and GPU, so the whole process is slower than just using the CPU. — Dr. Snoopy, Dec 18 '20 at 21:36
Thanks. But what do you mean by 'ou are not putting enough load in the GPU (model is very small) to overcome communication between CPU and GPU' — ctiid, Dec 19 '20 at 14:24

score 0 · Answer 1 · answered Sep 12 '22 at 15:02

Using the GPU puts a lot of overhead to load data on the GPU memory (through the relatively slow PCI bus) and to get the results back. In order for the GPU to be more efficient than the CPU, the model must to be very big, have plenty of data and use algorithms that can run fully inside the GPU, without requiring partial results to be moved back to the CPU.

The optimal configuration depends on the quantity of memory and of cores inside your GPU, so you must do some tests, but the following rules apply:

Your NN must have at least >10k parameters, training data set must have at least 10k records. Otherwise your overhead will probably kill the performances of GPU
When you model.fit, use a large batch_size (pay attention, the default is only 32), possibly to contain your whole dataset, or at least a multiple of 1024. Do some test to find the optimum for you.
For some GPUs, it might help performing computations in float16 instead of float32. Follow this tutorial to see how to activate it.
If your GPU has specific Tensor Cores, in order to use efficiently its hardware, several data must be multiples of 8. In the preceding tutorial, see at the paragraph "Ensuring GPU Tensor Cores are used" what parameters must be changed and how. In general, it's a bad idea to use layers which contain a number of neurons not multiple of 8.
Some type of layers, namely RNNs, have an architecture which cannot be solved directly by the GPU. In this case, data must be moved constantly back and forth to CPU and the speed is lost. If a RNN is really needed, Tensorflow v2 has an implementation of the LSTM layer which is optimized for GPU, but some limitations on the parameters are present: see this thread and the documetation.
If you are training a Reinforcement Learning, activate an Experience Replay and use a memory buffer for the experience which is at least >10x your batch_size. This way, you will activate the NN training only when a big bunch of data is ready.
Deactivate as much verbosity as possible

If everything is set up correctly, you should be able to train your model faster with GPU than with CPU.

score 0 · Answer 2 · answered Dec 02 '22 at 16:11

GPU is good if you have compute-intensive tasks (large models) due to the overhead of copying your data and results between the host and GPU. In your case, the model is very small. It means it will take you longer to copy data than to predict. Even if the CPU is slower than the GPU, you don't have to copy the data, so it's ultimately faster.

Prediction with GPU is much slower than with CPU?

2 Answers2