Keras inconsistent prediction time

Question

I tried to get an estimate of the prediction time of my keras model and realised something strange. Apart from being fairly fast normally, every once in a while the model needs quite long to come up with a prediction. And not only that, those times also increase the longer the model runs. I added a minimal working example to reproduce the error.

import time
import numpy as np
from sklearn.datasets import make_classification
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Flatten

# Make a dummy classification problem
X, y = make_classification()

# Make a dummy model
model = Sequential()
model.add(Dense(10, activation='relu',name='input',input_shape=(X.shape[1],)))
model.add(Dense(2, activation='softmax',name='predictions'))
model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])

model.fit(X, y, verbose=0, batch_size=20, epochs=100)

for i in range(1000):
    # Pick a random sample
    sample = np.expand_dims(X[np.random.randint(99), :], axis=0)
    # Record the prediction time 10x and then take the average
    start = time.time()
    for j in range(10):
        y_pred = model.predict_classes(sample)
    end = time.time()
    print('%d, %0.7f' % (i, (end-start)/10))

The time does not depend on the sample (it is being picked randomly). If the test is repeated, the indices in the for loop where the prediction takes longer are going to be (nearly) the same again.

I'm using:

tensorflow 2.0.0
python 3.7.4

For my application I need to guarantee the execution in a certain time. This is however impossible considering that behaviour. What is going wrong? Is it a bug in Keras or a bug in the tensorflow backend?

EDIT: predict_on_batch shows the same behavior, however, more sparse:

y_pred = model(sample, training=False).numpy() shows some heavy outliers as well, however, they are not increasing.

EDIT 2: I downgraded to the latest tensorflow 1 version (1.15). Not only is the problem not existent anymore, also the "normal" prediction time significantly improved! I do not see the two spikes as problematic, as they didn't appear when I repeated the test (at least not at the same indices and linearly increasing) and are percentual not as large as in the first plot.

We can thus conclude that this seems to be a problem inherent to tensorflow 2.0, which shows similar behaviour in other situations as @OverLordGoldDragon mentions.

That behavior sounds predictable.... the increase is sort of linear. If you include this behavior in your time calculation, will it not go? --- I don't know what is happening there.... but what happens if you try with `predict_on_batch` instead? — Daniel Möller, Feb 17 '20 at 17:53
Another attempt, what happens with `y_pred = model(sample).numpy()` and with `y_pred = model(sample, training=False).numpy()`? — Daniel Möller, Feb 17 '20 at 17:55
I added my findings. The numpy versions don't seem to show the behavior. — Sebastian Dengler, Feb 17 '20 at 18:51
But `predict_classes` is still the fastest.... it seems. What about just `predict`? — Daniel Möller, Feb 17 '20 at 18:55
Actually both of them are by far faster than `predict_classes`. I increased the loop to 100 to even be able to measure the time and did not divide the result in the end. `predict` shows the same behaviour as `predict_classes`. — Sebastian Dengler, Feb 17 '20 at 19:18
Do you want to guarantee the time of many iterations or a single? — Daniel Möller, Feb 19 '20 at 18:13

OverLordGoldDragon · Accepted Answer · 2020-02-26T13:27:46.553

10

TF2 generally exhibits poor and bug-like memory management in several instances I've encountered - brief description here and here. With prediction in particular, the most performant feeding method is via model(x) directly - see here, and its linked discussions.

In a nutshell: model(x) acts via its its __call__ method (which it inherits from base_layer.Layer), whereas predict(), predict_classes(), etc. involve a dedicated loop function via _select_training_loop(); each utilize different data pre- and post-processing methods suited for different use-cases, and model(x) in 2.1 was designed specifically to yield fastest small-model / small-batch (and maybe any-size) performance (and still fastest in 2.0).

Quoting a TensorFlow dev from linked discussions:

You can predict the output using model call, not model predict, i.e., calling model(x) would make this much faster because there are no "conversion to dataset" part, and also it's directly calling a cached tf.function.

Note: this should be less of an issue in 2.1, and especially 2.2 - but test each method anyway. Also I realize this doesn't directly answer your question on the time spikes; I suspect it's related to Eager caching mechanisms, but the surest way to determine is via TF Profiler, which is broken in 2.1.

Update: regarding increasing spikes, possible GPU throttling; you've done ~1000 iters, try 10,000 instead - eventually, the increasing should stop. As you noted in your comments, this doesn't occur with model(x); makes sense as one less GPU step is involved ("conversion to dataset").

Update2: you could bug the devs here about it if you face this issue; it's mostly me singing there

edited Feb 26 '20 at 13:27

answered Feb 19 '20 at 18:36

OverLordGoldDragon

1
9
53
101

This is a good answer to why one method is slower, but it doesn't explain the increasing run time over multiple runs. – LLSv2.0 Feb 19 '20 at 19:03
1

@LLSv2.0 Not entirely sure myself, but updated answer - I'm still awaiting a response from devs when I raised this issue myself [here](https://github.com/tensorflow/tensorflow/issues/33487#issuecomment-585221997) – OverLordGoldDragon Feb 19 '20 at 19:12
Thanks for the answer! I'll try going back to TF1 and see what happens there. As for the GPU explanation: I don't think it's correct, as I'm training and predicting solely on CPU. – Sebastian Dengler Feb 19 '20 at 19:34
1

@ga97dil Yeah, then I'm out of explanations - try asking on Github, though you may face lengthy response times. – OverLordGoldDragon Feb 19 '20 at 19:53
I downgraded to tensorflow 1.15. No increasing spikes and nearly 10 times faster prediction speed. It's incredible.. The true reason for this is probably beyond me. For now i'll keep using TF1 and stay away from 2. – Sebastian Dengler Feb 19 '20 at 20:52
If Tensorflow/Keras didn't have a better result over Pytorch, I would have switched to Torch, it's waaaaaaaay faster. Only pure TF1 has comparable speed. – Daniel Möller Feb 19 '20 at 21:06
1

@ga97dil Indeed, TF1 can be [much faster](https://stackoverflow.com/questions/58441514/why-is-tensorflow-2-much-slower-than-tensorflow-1) than TF2 - though TF 2.1 is worth trying for small models & datasets, as it's the fastest in training in my benchmarks (didn't do prediction). More importantly, if you ever use TF2, I strongly suggest you test _reproducibility_ in Graph vs. Eager; results can differ _extremely_ in TF 2.1. – OverLordGoldDragon Feb 19 '20 at 21:15
1

I've added your post to the [Git thread](https://github.com/tensorflow/tensorflow/issues/33487#issuecomment-588480988), and my TF2 vs. TF1 post. Thanks for informing me that the problem disappears in TF 1. – OverLordGoldDragon Feb 19 '20 at 21:32

score 2 · Answer 2 · answered Mar 11 '20 at 06:07

While I can't explain the inconsistencies in execution time, I can recommend that you try to convert your model to TensorFlow Lite to speed up predictions on single data records or small batches.

I ran a benchmark on this model:

model = tf.keras.models.Sequential([
    tf.keras.layers.Dense(384, activation='elu', input_shape=(256,)),
    tf.keras.layers.Dense(384, activation='elu'),
    tf.keras.layers.Dense(256, activation='elu'),
    tf.keras.layers.Dense(128, activation='elu'),
    tf.keras.layers.Dense(32, activation='tanh')
])

The prediction times for single records were:

model.predict(input): 18ms
model(input): 1.3ms
Model converted to TensorFlow Lite: 43us

The time to convert the model was 2 seconds.

The class below shows how to convert and use the model and provides a predict method like the Keras model. Note that it would need to be modified for use with models that don’t just have a single 1-D input and a single 1-D output.

class LiteModel:

    @classmethod
    def from_file(cls, model_path):
        return LiteModel(tf.lite.Interpreter(model_path=model_path))

    @classmethod
    def from_keras_model(cls, kmodel):
        converter = tf.lite.TFLiteConverter.from_keras_model(kmodel)
        tflite_model = converter.convert()
        return LiteModel(tf.lite.Interpreter(model_content=tflite_model))

    def __init__(self, interpreter):
        self.interpreter = interpreter
        self.interpreter.allocate_tensors()
        input_det = self.interpreter.get_input_details()[0]
        output_det = self.interpreter.get_output_details()[0]
        self.input_index = input_det["index"]
        self.output_index = output_det["index"]
        self.input_shape = input_det["shape"]
        self.output_shape = output_det["shape"]
        self.input_dtype = input_det["dtype"]
        self.output_dtype = output_det["dtype"]

    def predict(self, inp):
        inp = inp.astype(self.input_dtype)
        count = inp.shape[0]
        out = np.zeros((count, self.output_shape[1]), dtype=self.output_dtype)
        for i in range(count):
            self.interpreter.set_tensor(self.input_index, inp[i:i+1])
            self.interpreter.invoke()
            out[i] = self.interpreter.get_tensor(self.output_index)[0]
        return out

    def predict_single(self, inp):
        """ Like predict(), but only for a single record. The input data can be a Python list. """
        inp = np.array([inp], dtype=self.input_dtype)
        self.interpreter.set_tensor(self.input_index, inp)
        self.interpreter.invoke()
        out = self.interpreter.get_tensor(self.output_index)
        return out[0]

The complete benchmark code and a plot can be found here: https://medium.com/@micwurm/using-tensorflow-lite-to-speed-up-predictions-a3954886eb98

Cool, never tried it before but maybe it would be worth a shot. Thanks for the hint! — Sebastian Dengler, Mar 11 '20 at 15:48

Keras inconsistent prediction time

2 Answers2

Linked