32

prediction speed keras

In theory, the prediction should be constant as the weights have a fixed size. How do I get my speed back after compile (without the need to remove optimizer)?

See associated experiment: https://nbviewer.jupyter.org/github/off99555/TensorFlowExperiments/blob/master/test-prediction-speed-after-compile.ipynb?flush_cache=true

OverLordGoldDragon
  • 1
  • 9
  • 53
  • 101
off99555
  • 3,797
  • 3
  • 37
  • 49
  • I think you need to fit the model after compilation then use the trained model to predict. Refer [here](https://keras.io/models/model/) – naive Oct 14 '19 at 14:08
  • @naive Fitting is irrelevant to the issue. If you know how the network actually work you would be curious why the prediction is slower. When predicting, only the weights are used for matrix multiplication, and the weights must be fixed before and after compile, so the prediction time should stay constant. – off99555 Oct 14 '19 at 15:17
  • I know that is irrelevant to the *issue*. And, one does not need to know how the network works to point out that the tasks you have come up with and comparing accuracy for is actually meaningless. Without fitting the model over some data you are predicting and you are actually comparing the time taken. This is not the usual or right use cases for a neural network – naive Oct 14 '19 at 15:20
  • And, about the higher time that you recorded. How reliable is your estimate? You understand what an *estimate* is, I guess. Anyways [point estimate](https://en.wikipedia.org/wiki/Point_estimation) – naive Oct 14 '19 at 15:23
  • 3
    @naive The problem concerns understanding model performance compiled vs uncompiled, having nothing to do with accuracy or model design. It's a legitimate issue that can cost TF users - I for one didn't have a clue about it till stumbling upon this question. – OverLordGoldDragon Oct 14 '19 at 15:33
  • @OverLordGoldDragon Okay. But, how useful is `model.compile()` for making predictions. My point is that the `.predict()` method is most often used after `model.fit()` unlike the example in the question. – naive Oct 14 '19 at 15:42
  • 1
    @naive You cannot `fit` without `compile`; optimizer doesn't even exist to update any weights. `predict` _can_ be used without `fit` or `compile` as described in my answer, but the performance difference shouldn't be this dramatic - hence the problem. – OverLordGoldDragon Oct 14 '19 at 15:52

2 Answers2

46

UPDATE - 1/15/2020: the current best practice for small batch sizes should be to feed inputs to the model directly - i.e. preds = model(x), and if layers behave differently at train / inference, model(x, training=False). Per latest commit, this is now documented.

I haven't benchmarked these, but per the Git discussion, it's also worth trying predict_on_batch() - especially with improvements in TF 2.1.


ULTIMATE CULPRIT: self._experimental_run_tf_function = True. It's experimental. But it's not actually bad.

To any TensorFlow devs reading: clean up your code. It's a mess. And it violates important coding practices, such as one function does one thing; _process_inputs does a lot more than "process inputs", same for _standardize_user_data. "I'm not paid enough" - but you do pay, in extra time spent understanding your own stuff, and in users filling your Issues page with bugs easier resolved with a clearer code.


SUMMARY: it's only a little slower with compile().

compile() sets an internal flag which assigns a different prediction function to predict. This function constructs a new graph upon each call, slowing it down relative to uncompiled. However, the difference is only pronounced when train time is much shorter than data processing time. If we increase the model size to at least mid-sized, the two become equal. See code at the bottom.

This slight increase in data processing time is more than compensated by amplified graph capability. Since it's more efficient to keep only one model graph around, the one pre-compile is discarded. Nonetheless: if your model is small relative to data, you are better off without compile() for model inference. See my other answer for a workaround.


WHAT SHOULD I DO?

Compare model performance compiled vs uncompiled as I have in code at the bottom.

  • Compiled is faster: run predict on a compiled model.
  • Compiled is slower: run predict on an uncompiled model.

Yes, both are possible, and it will depend on (1) data size; (2) model size; (3) hardware. Code at the bottom actually shows compiled model being faster, but 10 iterations is a small sample. See "workarounds" in my other answer for the "how-to".


DETAILS:

This took a while to debug, but was fun. Below I describe the key culprits I discovered, cite some relevant documentation, and show profiler results that led to the ultimate bottleneck.

(FLAG == self.experimental_run_tf_function, for brevity)

  1. Model by default instantiates with FLAG=False. compile() sets it to True.
  2. predict() involves acquiring the prediction function, func = self._select_training_loop(x)
  3. Without any special kwargs passed to predict and compile, all other flags are such that:
    • (A) FLAG==True --> func = training_v2.Loop()
    • (B) FLAG==False --> func = training_arrays.ArrayLikeTrainingLoop()
  4. From source code docstring, (A) is heavily graph-reliant, uses more distribution strategy, and ops are prone to creating & destroying graph elements, which "may" (do) impact performance.

True culprit: _process_inputs(), accounting for 81% of runtime. Its major component? _create_graph_function(), 72% of runtime. This method does not even exist for (B). Using a mid-sized model, however, _process_inputs comprises less than 1% of runtime. Code at bottom, and profiling results follow.


DATA PROCESSORS:

(A): <class 'tensorflow.python.keras.engine.data_adapter.TensorLikeDataAdapter'>, used in _process_inputs() . Relevant source code

(B): numpy.ndarray, returned by convert_eager_tensors_to_numpy. Relevant source code, and here


MODEL EXECUTION FUNCTION (e.g. predict)

(A): distribution function, and here

(B): distribution function (different), and here


PROFILER: results for code in my other answer, "tiny model", and in this answer, "medium model":

Tiny model: 1000 iterations, compile()

Tiny model: 1000 iterations, no compile()

Medium model: 10 iterations


DOCUMENTATION (indirectly) on effects of compile(): source

Unlike other TensorFlow operations, we don't convert python numerical inputs to tensors. Moreover, a new graph is generated for each distinct python numerical value, for example calling g(2) and g(3) will generate two new graphs

function instantiates a separate graph for every unique set of input shapes and datatypes. For example, the following code snippet will result in three distinct graphs being traced, as each input has a different shape

A single tf.function object might need to map to multiple computation graphs under the hood. This should be visible only as performance (tracing graphs has a nonzero computational and memory cost) but should not affect the correctness of the program


COUNTEREXAMPLE:

from tensorflow.keras.layers import Input, Dense, LSTM, Bidirectional, Conv1D
from tensorflow.keras.layers import Flatten, Dropout
from tensorflow.keras.models import Model
import numpy as np
from time import time

def timeit(func, arg, iterations):
    t0 = time()
    for _ in range(iterations):
        func(arg)
    print("%.4f sec" % (time() - t0))

batch_size = 32
batch_shape = (batch_size, 400, 16)
ipt   = Input(batch_shape=batch_shape)
x     = Bidirectional(LSTM(512, activation='relu', return_sequences=True))(ipt)
x     = LSTM(512, activation='relu', return_sequences=True)(ipt)
x     = Conv1D(128, 400, 1, padding='same')(x)
x     = Flatten()(x)
x     = Dense(256, activation='relu')(x)
x     = Dropout(0.5)(x)
x     = Dense(128, activation='relu')(x)
x     = Dense(64,  activation='relu')(x)
out   = Dense(1,  activation='sigmoid')(x)
model = Model(ipt, out)

X = np.random.randn(*batch_shape)
timeit(model.predict, X, 10)
model.compile('adam', loss='binary_crossentropy')
timeit(model.predict, X, 10)

Outputs:

34.8542 sec
34.7435 sec
OverLordGoldDragon
  • 1
  • 9
  • 53
  • 101
  • 1
    What's the conclusion on what we should do to obtain the fastest prediction speed for any model size? Is it to just not do `compile()`? – off99555 Oct 14 '19 at 23:36
  • 3
    @off99555 "for any model size" - there's no such thing. Read the entire answer - if I took hours to debug it, a few minutes from the asker shouldn't be unreasonable. – OverLordGoldDragon Oct 14 '19 at 23:45
  • 1
    I read the entire thing but it's hard to understand because I am not the one that debugged the code. So you need to give a conclusion that does not involve the intermediate variables that you find during the debugging phase. E.g. "If your model is small, then do not use compile. If your model is medium size, you can use compile. ` Something like that. – off99555 Oct 14 '19 at 23:49
  • 1
    @off99555 Fair enough; updated. The new section is fairly common-sense, but I can see it not being realized immediately. – OverLordGoldDragon Oct 14 '19 at 23:51
  • That conclusion is entirely possible that there could be a time compile is faster or a time it is slower. So I will check both and stick with the one that is faster. Now I'm not sure how TensorFlow Lite behaves, maybe it's removing the compilation part before sending it to mobile phone so I don't have to worry about compiling or not compiling. But I'm not sure though. – off99555 Oct 14 '19 at 23:56
  • 1
    @off99555 Not that I tested, but very large models (ResNet, etc) may run notably faster compiled, esp. if _distributed_ on many devices - as **(A)** is more graph- and distribution-heavy. The surest test is, well, a test - like in the answer. Unfamiliar with TF lite, but that's a separate question – OverLordGoldDragon Oct 14 '19 at 23:58
  • 1
    If you still want to use `fit()` and thus need to compile the model, in my case compiling like `model.compile(..., experimental_run_tf_function = False)` also fixed the performance problems. – Carsten Jan 18 '20 at 23:53
19

UPDATE: see actual answer posted as a separate answer; this post contains supplemental info


.compile() sets up the majority of TF/Keras graph, including losses, metrics, gradients, and partly the optimizer and its weights - which guarantees a notable slowdown.

What is unexpected is the extent of slowdown - 10-fold on my own experiment, and for predict(), which doesn't update any weights. Looking into TF2's source code, graph elements appear tightly intertwined, with resources not necessarily being allocated "fairly".

Possible overlook by developers on predict's performance for an uncompiled model, as models are typically used compiled - but in practice, this is an unacceptable difference. It's also possible it's a "necessary evil", as there is a simple workaround (see below).

This isn't a complete answer, and I hope someone can provide it here - if not, I'd suggest opening a Github issue on TensorFlow. (OP has; here)


Workaround: train a model, save its weights, re-build the model without compiling, load the weights. Do not save the entire model (e.g. model.save()), as it'll load compiled - instead use model.save_weights() and model.load_weights().

Workaround 2: above, but use load_model(path, compile=False); suggestion credit: D. Möller


UPDATE: to clarify, optimizer is not fully instantiated with compile, including its weights and updates tensors - this is done when the first call to a fitting function is made (fit, train_on_batch, etc), via model._make_train_function().

The observed behavior is thus even more strange. Worse yet, building the optimizer does not elicit any further slowdowns (see below) - suggesting "graph size" is not the main explanation here.


EDIT: on some models, a 30x slowdown. TensorFlow, what have you done. Example below:

from tensorflow.keras.layers import Input, Dense
from tensorflow.keras.models import Model
import numpy as np
from time import time

def timeit(func, arg, iterations):
    t0 = time()
    for _ in range(iterations):
        func(arg)
    print("%.4f sec" % (time() - t0))

ipt   = Input(shape=(4,))
x     = Dense(2, activation='relu')(ipt)
out   = Dense(1, activation='sigmoid')(x)
model = Model(ipt, out)

X = np.random.randn(32,4)

timeit(model.predict, X, 1000)
model.compile('adam', loss='binary_crossentropy')
timeit(model.predict, X, 1000)
model._make_train_function()  # build optimizer
timeit(model.predict, X, 1000)

Outputs:

0.9891 sec
29.785 sec
29.521 sec
OverLordGoldDragon
  • 1
  • 9
  • 53
  • 101
  • 1
    That's interesting. It's been a while I want to test training with a static graph `model.fit()` versus a dynamic loop with eager execution on to see whether the performance loss is too big... – Daniel Möller Oct 14 '19 at 15:01
  • 1
    In the past I could notice a significant speed difference between Keras and PyTorch (being PyTorch way faster). – Daniel Möller Oct 14 '19 at 15:01
  • 1
    I have opened an issue here: https://github.com/tensorflow/tensorflow/issues/33340 – off99555 Oct 14 '19 at 15:15
  • @DanielMöller Really - is it still the case now? For one, I observe a **4x** slowdown on `OptimizerV2` optimizers vs. `Optimizer` optimizers, for no good reason - and the source code is a tangled mess. I've fairly invested self into studying TF and its graphs, but if it continues in this manner, I may ditch it for the torch – OverLordGoldDragon Oct 14 '19 at 15:19
  • I don't know how it is with Tensorflow 2.0 and its Keras package. But I noticed this weird extra complicated `OptimizerV2` code and I though "why?". I still prefer Keras because it's definitely easier to get parts of models, intermediate results and create all sorts of Frankenstein models. And since it now seems to work very well on eager mode with custom loops (pretty much what you do normally in PyTorch), I kinda like it for easiness of coding. If it wasn't Keras ability to easily make models from models, I'd go to PyTorch. – Daniel Möller Oct 14 '19 at 15:25
  • Back then, I was reading that tf and torch had similar speeds, and Keras was what made it slow. But I can't say which part of Keras made it slow, and I didn't really test because I was not used to pure tensorflow. – Daniel Möller Oct 14 '19 at 15:30
  • @DanielMöller Calling `predict` after `model._set_optimizer` fails with `'Model' object has no attribute 'loss'` - suggesting that `predict` is indeed linked with train methods and is hence affected by `compile`. In fact, that step redirects `predict`'s execution entirely down `fit`'s road. This is quite interesting, think I'll investigate – OverLordGoldDragon Oct 14 '19 at 17:13
  • 2
    Yes. It is a bad design choice that you put training related code inside prediction. Because users will use this predict function sequentially multiple times in production. It should work fastest to cause the least surprise. Comparing to numpy implementation, you only need to multiply a matrix, add a bias, activate, and that's it for a dense layer. There is no need to concern any loss function. – off99555 Oct 14 '19 at 17:18
  • 2
    Hint, you can use `load_model(name, compile=False)`, it's simpler than saving/loading weights and recreating the model. – Daniel Möller Oct 14 '19 at 20:17
  • @off99555 I think I figured it out, and will write answer separately as it'll be long; you'll never guess the main source of the problem, as neither would I - and it does seem like a bug, a big bad one – OverLordGoldDragon Oct 14 '19 at 21:09