Why is TensorFlow 2 much slower than TensorFlow 1?

Question

It's been cited by many users as the reason for switching to Pytorch, but I've yet to find a justification/explanation for sacrificing the most important practical quality, speed, for eager execution.

Below is code benchmarking performance, TF1 vs. TF2 - with TF1 running anywhere from 47% to 276% faster.

My question is: what is it, at the graph or hardware level, that yields such a significant slowdown?

Looking for a detailed answer - am already familiar with broad concepts. Relevant Git

Specs: CUDA 10.0.130, cuDNN 7.4.2, Python 3.7.4, Windows 10, GTX 1070

Benchmark results:

UPDATE: Disabling Eager Execution per below code does not help. The behavior, however, is inconsistent: sometimes running in graph mode helps considerably, other times it runs slower relative to Eager.

Benchmark code:

# use tensorflow.keras... to benchmark tf.keras; used GPU for all above benchmarks
from keras.layers import Input, Dense, LSTM, Bidirectional, Conv1D
from keras.layers import Flatten, Dropout
from keras.models import Model
from keras.optimizers import Adam
import keras.backend as K
import numpy as np
from time import time

batch_shape = (32, 400, 16)
X, y = make_data(batch_shape)

model_small = make_small_model(batch_shape)
model_small.train_on_batch(X, y)  # skip first iteration which builds graph
timeit(model_small.train_on_batch, 200, X, y)

K.clear_session()  # in my testing, kernel was restarted instead

model_medium = make_medium_model(batch_shape)
model_medium.train_on_batch(X, y)  # skip first iteration which builds graph
timeit(model_medium.train_on_batch, 10, X, y)

Functions used:

def timeit(func, iterations, *args):
    t0 = time()
    for _ in range(iterations):
        func(*args)
    print("Time/iter: %.4f sec" % ((time() - t0) / iterations))

def make_small_model(batch_shape):
    ipt   = Input(batch_shape=batch_shape)
    x     = Conv1D(128, 400, strides=4, padding='same')(ipt)
    x     = Flatten()(x)
    x     = Dropout(0.5)(x)
    x     = Dense(64, activation='relu')(x)
    out   = Dense(1,  activation='sigmoid')(x)
    model = Model(ipt, out)
    model.compile(Adam(lr=1e-4), 'binary_crossentropy')
    return model

def make_medium_model(batch_shape):
    ipt   = Input(batch_shape=batch_shape)
    x     = Bidirectional(LSTM(512, activation='relu', return_sequences=True))(ipt)
    x     = LSTM(512, activation='relu', return_sequences=True)(x)
    x     = Conv1D(128, 400, strides=4, padding='same')(x)
    x     = Flatten()(x)
    x     = Dense(256, activation='relu')(x)
    x     = Dropout(0.5)(x)
    x     = Dense(128, activation='relu')(x)
    x     = Dense(64,  activation='relu')(x)
    out   = Dense(1,   activation='sigmoid')(x)
    model = Model(ipt, out)
    model.compile(Adam(lr=1e-4), 'binary_crossentropy')
    return model
    
def make_data(batch_shape):
    return np.random.randn(*batch_shape), np.random.randint(0, 2, (batch_shape[0], 1))

Have you ever used cProfile such kind of tool to analyze which part makes them so differently? — zihaozhihao, Oct 18 '19 at 01:15
@zihaozhihao [I have](https://stackoverflow.com/questions/58378374/why-does-keras-model-predict-slower-after-compile/58385156#58385156), though not for this specifically; per previous link and writing a custom optimizer, I'm already familiar with differences in calls, but don't _understand_ why one's slower than the other - nor can any non-TF expert understand it from the source, which, on top of being a tangled mess, doesn't document relative performances. Graph/hardware-level intel is required, which profilers won't provide (as far as I'm able to use them) — OverLordGoldDragon, Oct 18 '19 at 01:46
Ouch.... If old Keras alone was already significantly slower than PyTorch, imagine now. — Daniel Möller, Oct 31 '19 at 16:35
does the problem scale with model size? also have you tried to run the same benchmark on other OS? — Oleg Vorobiov, Oct 31 '19 at 17:25
While training classifier on simple dataset like CIFAR-10, along with speed I also had lower accuracy after the same number of epochs when trained on TF1 vs TF2. For example, when a simple model is trained for 50 epochs, it showed better accuracy on TF1 than TF2 which is weird. — Mohbat Tharani, Nov 25 '20 at 08:07

OverLordGoldDragon · Accepted Answer · 2020-08-29T21:31:02.980

UPDATE 8/1730/2020: TF 2.3 has finally done it: all cases run as fast, or notably faster, than any previous version.

Further, my previous update was unfair to TF; my GPU was to blame, has been overheating lately. If you see a rising stem plot of iteration times, it's a reliable symptom. Lastly, see a dev's note on Eager vs Graph.

This might be my last update on this answer. The true stats on your model's speed can only be found by you, on your device.

UPDATE 5/19/2020: TF 2.2, using same tests: only a minor improvement in Eager speed. Plots for Large-Large Numpy train_on_batch case below, x-axis is successive fit iterations; my GPU isn't near its full capacity, so doubt it's throttling, but iterations do get slower over time.

Per above, Graph and Eager are 1.56x and 1.97x slower than their TF1 counterparts, respectively. Unsure I'll debug this further, as I'm considering switching to Pytorch per TensorFlow's poor support for custom / low-level functionality. I did, however, open an Issue to get devs' feedback.

UPDATE 2/18/2020: I've benched 2.1 and 2.1-nightly; the results are mixed. All but one configs (model & data size) are as fast as or much faster than the best of TF2 & TF1. The one that's slower, and slower dramatically, is Large-Large - esp. in Graph execution (1.6x to 2.5x slower).

Furthermore, there are extreme reproducibility differences between Graph and Eager for a large model I tested - one not explainable via randomness/compute-parallelism. I can't currently present reproducible code for these claims per time constraints, so instead I strongly recommend testing this for your own models.

Haven't opened a Git issue on these yet, but I did comment on the original - no response yet. I'll update the answer(s) once progress is made.

VERDICT: it isn't, IF you know what you're doing. But if you don't, it could cost you, lots - by a few GPU upgrades on average, and by multiple GPUs worst-case.

THIS ANSWER: aims to provide a high-level description of the issue, as well as guidelines for how to decide on the training configuration specific to your needs. For a detailed, low-level description, which includes all benchmarking results + code used, see my other answer.

I'll be updating my answer(s) w/ more info if I learn any - can bookmark / "star" this question for reference.

ISSUE SUMMARY: as confirmed by a TensorFlow developer, Q. Scott Zhu, TF2 focused development on Eager execution & tight integration w/ Keras, which involved sweeping changes in TF source - including at graph-level. Benefits: greatly expanded processing, distribution, debug, and deployment capabilities. The cost of some of these, however, is speed.

The matter, however, is fairly more complex. It isn't just TF1 vs. TF2 - factors yielding significant differences in train speed include:

TF2 vs. TF1
Eager vs. Graph mode
keras vs. tf.keras
numpy vs. tf.data.Dataset vs. ...
train_on_batch() vs. fit()
GPU vs. CPU
model(x) vs. model.predict(x) vs. ...

Unfortunately, almost none of the above are independent of the other, and each can at least double execution time relative to another. Fortunately, you can determine what'll work best systematically, and with a few shortcuts - as I'll be showing.

WHAT SHOULD I DO? Currently, the only way is - experiment for your specific model, data, and hardware. No single configuration will always work best - but there are do's and don't's to simplify your search:

>> DO:

train_on_batch() + numpy + tf.keras + TF1 + Eager/Graph
train_on_batch() + numpy + tf.keras + TF2 + Graph
fit() + numpy + tf.keras + TF1/TF2 + Graph + large model & data

>> DON'T:

fit() + numpy + keras for small & medium models and data
fit() + numpy + tf.keras + TF1/TF2 + Eager
train_on_batch() + numpy + keras + TF1 + Eager
[Major] tf.python.keras; it can run 10-100x slower, and w/ plenty of bugs; more info
- This includes layers, models, optimizers, & related "out-of-box" usage imports; ops, utils, & related 'private' imports are fine - but to be sure, check for alts, & whether they're used in tf.keras

Refer to code at bottom of my other answer for an example benchmarking setup. The list above is based mainly on the "BENCHMARKS" tables in the other answer.

LIMITATIONS of the above DO's & DON'T's:

This question's titled "Why is TF2 much slower than TF1?", and while its body concerns training explicitly, the matter isn't limited to it; inference, too, is subject to major speed differences, even within the same TF version, import, data format, etc. - see this answer.
RNNs are likely to notably change the data grid in the other answer, as they've been improved in TF2
Models primarily used Conv1D and Dense - no RNNs, sparse data/targets, 4/5D inputs, & other configs
Input data limited to numpy and tf.data.Dataset, while many other formats exist; see other answer
GPU was used; results will differ on a CPU. In fact, when I asked the question, my CUDA wasn't properly configured, and some of the results were CPU-based.

Why did TF2 sacrifice the most practical quality, speed, for eager execution? It hasn't, clearly - graph is still available. But if the question is "why eager at all":

Superior debugging: you've likely come across multitudes of questions asking "how do I get intermediate layer outputs" or "how do I inspect weights"; with eager, it's (almost) as simple as .__dict__. Graph, in contrast, requires familiarity with special backend functions - greatly complicating the entire process of debugging & introspection.
Faster prototyping: per ideas similar to above; faster understanding = more time left for actual DL.

HOW TO ENABLE/DISABLE EAGER?

tf.enable_eager_execution()  # TF1; must be done before any model/tensor creation
tf.compat.v1.disable_eager_execution() # TF2; above holds

Misleading in TF2; see here.

ADDITIONAL INFO:

Careful with _on_batch() methods in TF2; according to the TF dev, they still use a slower implementation, but not intentionally - i.e. it's to be fixed. See other answer for details.

REQUESTS TO TENSORFLOW DEVS:

~~Please fix train_on_batch(), and the performance aspect of calling fit() iteratively; custom train loops are important to many, especially to me.~~
~~Add documentation / docstring mention of these performance differences for users' knowledge.~~
~~Improve general execution speed to keep peeps from hopping to Pytorch.~~

ACKNOWLEDGEMENTS: Thanks to

Q. Scott Zhu, TensorFlow developer, for his detailed clarification on the matter.
P. Andrey for sharing useful testing, and discussion.

UPDATES:

11/14/19 - found a model (in my real application) that that runs slower on TF2 for all* configurations w/ Numpy input data. Differences ranged 13-19%, averaging 17%. Differences between keras and tf.keras, however, were more dramatic: 18-40%, avg. 32% (both TF1 & 2). (* - except Eager, for which TF2 OOM'd)
11/17/19 - devs updated on_batch() methods in a recent commit, stating to have improved speed - to be released in TF 2.1, or available now as tf-nightly. As I'm unable to get latter running, will delay benching until 2.1.
2/20/20 - prediction performance is also worth benching; in TF2, for example, CPU prediction times can involve periodic spikes

What about `fit_generator`? ... I virtually never want `train_on_batch` and managing my own training loop across batches is a huge, huge anti-pattern to be avoided even at great cost. — ely, Nov 05 '19 at 17:25
@ely It remains to be tested, as noted in my other answer - but if anything I predict it to be `fit` w/ small additional data processing overhead. As for train loops, I wrote my own custom one that ultimately turned into a sort of API; `fit_generator` lacks introspection, customizability, and save/load - so an absolute nope for me. I'll be publishing my training loop eventually, on Github. — OverLordGoldDragon, Nov 05 '19 at 17:30
Lacking introspection and customizability is a feature for me, not a bug. IDK what the save/load comment is referring to? Intermediate saving/loading during a loop not controlled by the data generator? (I'm also personally happy relying only on callbacks for that, and would see need for any further customizability as a code smell that my training loop is designed wrong). — ely, Nov 05 '19 at 17:34
@ely It isn't simple, but it's necessary for training w/ complex input data pipelines, objective functions, and non-API model configurations (e.g. ensembles). Introspection is a must for many debugging & feature engineering purposes. Lacking an external save/load, and train loop _pausability_ & _resumability_ for computationally-expensive models - a nightmare. Regardless, ultimately depends on your specific needs, and getting off-topic; the surest way to test performance w/ `fit_generator` for you application is to, well, test it. — OverLordGoldDragon, Nov 05 '19 at 18:10

score 71 · Answer 2 · edited Dec 17 '21 at 11:59

THIS ANSWER: aims to provide a detailed, graph/hardware-level description of the issue - including TF2 vs. TF1 train loops, input data processors, and Eager vs. Graph mode executions. For an issue summary & resolution guidelines, see my other answer.

PERFORMANCE VERDICT: sometimes one is faster, sometimes the other, depending on configuration. As far as TF2 vs TF1 goes, they're about on par on average, but significant config-based differences do exist, and TF1 trumps TF2 more often than vice versa. See "BENCHMARKING" below.

EAGER VS. GRAPH: the meat of this entire answer for some: TF2's eager is slower than TF1's, according to my testing. Details further down.

The fundamental difference between the two is: Graph sets up a computational network proactively, and executes when 'told to' - whereas Eager executes everything upon creation. But the story only begins here:

Eager is NOT devoid of Graph, and may in fact be mostly Graph, contrary to expectation. What it largely is, is executed Graph - this includes model & optimizer weights, comprising a great portion of the graph.
Eager rebuilds part of own graph at execution; a direct consequence of Graph not being fully built -- see profiler results. This has a computational overhead.
Eager is slower w/ Numpy inputs; per this Git comment & code, Numpy inputs in Eager include the overhead cost of copying tensors from CPU to GPU. Stepping through source code, data handling differences are clear; Eager directly passes Numpy, while Graph passes tensors which then evaluate to Numpy; uncertain of the exact process, but latter should involve GPU-level optimizations
TF2 Eager is slower than TF1 Eager - this is... unexpected. See benchmarking results below. Differences span from negligible to significant but are consistent. Unsure why it's the case - if a TF dev clarifies, will update the answer.

TF2 vs. TF1: quoting relevant portions of a TF dev's, Q. Scott Zhu's, response - w/ bit of my emphasis & rewording:

In eager, the runtime needs to execute the ops and return the numerical value for every line of python code. The nature of single step execution causes it to be slow.

In TF2, Keras leverages tf.function to build its graph for training, eval, and prediction. We call them "execution function" for the model. In TF1, the "execution function" was a FuncGraph, which shared some common components as the TF function, but has a different implementation.

During the process, we somehow left an incorrect implementation for train_on_batch(), test_on_batch() and predict_on_batch(). They are still numerically correct, but the execution function for x_on_batch is a pure python function, rather than a tf.function wrapped python function. This will cause slowness

In TF2, we convert all input data into a tf.data.Dataset, by which we can unify our execution function to handle the single type of the inputs. There might be some overhead in the dataset conversion, and I think this is a one-time-only overhead, rather than a per-batch cost

With the last sentence of the last paragraph above, and the last clause of the below paragraph:

To overcome the slowness in eager mode, we have @tf.function, which will turn a python function into a graph. When feed numerical value like np array, the body of the tf.function is converted into a static graph, being optimized, and return the final value, which is fast and should have similar performance as TF1 graph mode.

I disagree - per my profiling results, which show Eager's input data processing to be substantially slower than Graph's. Also, unsure about tf.data.Dataset in particular, but Eager does repeatedly call multiple of the same data conversion methods - see profiler.

Lastly, dev's linked commit: Significant number of changes to support the Keras v2 loops.

Train Loops: depending on (1) Eager vs. Graph; (2) input data format, training in will proceed with a distinct train loop - in TF2, _select_training_loop(), training.py, one of:

training_v2.Loop()
training_distributed.DistributionMultiWorkerTrainingLoop(
              training_v2.Loop()) # multi-worker mode
# Case 1: distribution strategy
training_distributed.DistributionMultiWorkerTrainingLoop(
            training_distributed.DistributionSingleWorkerTrainingLoop())
# Case 2: generator-like. Input is Python generator, or Sequence object,
# or a non-distributed Dataset or iterator in eager execution.
training_generator.GeneratorOrSequenceTrainingLoop()
training_generator.EagerDatasetOrIteratorTrainingLoop()
# Case 3: Symbolic tensors or Numpy array-like. This includes Datasets and iterators 
# in graph mode (since they generate symbolic tensors).
training_generator.GeneratorLikeTrainingLoop() # Eager
training_arrays.ArrayLikeTrainingLoop() # Graph

Each handles resource allocation differently and bears consequences on performance & capability.

Train Loops: fit vs train_on_batch, keras vs. tf.keras: each of the four uses different train loops, though perhaps not in every possible combination. keras' fit, for example, uses a form of fit_loop, e.g. training_arrays.fit_loop(), and its train_on_batch may use K.function(). tf.keras has a more sophisticated hierarchy described in part in previous section.

Train Loops: documentation -- relevant source docstring on some of the different execution methods:

Unlike other TensorFlow operations, we don't convert python numerical inputs to tensors. Moreover, a new graph is generated for each distinct python numerical value

function instantiates a separate graph for every unique set of input shapes and datatypes.

A single tf.function object might need to map to multiple computation graphs under the hood. This should be visible only as performance (tracing graphs has a nonzero computational and memory cost)

Input data processors: similar to above, the processor is selected case-by-case, depending on internal flags set according to runtime configurations (execution mode, data format, distribution strategy). The simplest case's with Eager, which works directly w/ Numpy arrays. For some specific examples, see this answer.

MODEL SIZE, DATA SIZE:

Is decisive; no single configuration crowned itself atop all model & data sizes.
Data size relative to model size is important; for small data & models, data transfer (e.g. CPU to GPU) overhead can dominate. Likewise, small overhead processors can run slower on large data per data conversion time dominating (see convert_to_tensor in "PROFILER")
Speed differs per train loops' and input data processors' differing means of handling resources.

BENCHMARKS: the grinded meat. -- Word Document -- Excel Spreadsheet

Terminology:

%-less numbers are all seconds
% computed as (1 - longer_time / shorter_time)*100; rationale: we're interested by what factor one is faster than the other; shorter / longer is actually a non-linear relation, not useful for direct comparison
% sign determination:
- TF2 vs TF1: + if TF2 is faster
- GvE (Graph vs. Eager): + if Graph is faster
TF2 = TensorFlow 2.0.0 + Keras 2.3.1; TF1 = TensorFlow 1.14.0 + Keras 2.2.5

PROFILER:

PROFILER - Explanation: Spyder 3.3.6 IDE profiler.

Some functions are repeated in nests of others; hence, it's hard to track down the exact separation between "data processing" and "training" functions, so there will be some overlap - as pronounced in the very last result.
% figures computed w.r.t. runtime minus build time
Build time computed by summing all (unique) runtimes which were called 1 or 2 times
Train time computed by summing all (unique) runtimes which were called the same # of times as the # of iterations, and some of their nests' runtimes
Functions are profiled according to their original names, unfortunately (i.e. _func = func will profile as func), which mixes in build time - hence the need to exclude it

TESTING ENVIRONMENT:

Executed code at bottom w/ minimal background tasks running
GPU was "warmed up" w/ a few iterations before timing iterations, as suggested in this post
CUDA 10.0.130, cuDNN 7.6.0, TensorFlow 1.14.0, & TensorFlow 2.0.0 built from source, plus Anaconda
Python 3.7.4, Spyder 3.3.6 IDE
GTX 1070, Windows 10, 24GB DDR4 2.4-MHz RAM, i7-7700HQ 2.8-GHz CPU

METHODOLOGY:

Benchmark 'small', 'medium', & 'large' model & data sizes
Fix # of parameters for each model size, independent of input data size
"Larger" model has more parameters and layers
"Larger" data has a longer sequence, but same batch_size and num_channels
Models only use Conv1D, Dense 'learnable' layers; RNNs avoided per TF-version implem. differences
Always ran one train fit outside of benchmarking loop, to omit model & optimizer graph building
Not using sparse data (e.g. layers.Embedding()) or sparse targets (e.g. SparseCategoricalCrossEntropy()

LIMITATIONS: a "complete" answer would explain every possible train loop & iterator, but that's surely beyond my time ability, nonexistent paycheck, or general necessity. The results are only as good as the methodology - interpret with an open mind.

CODE:

import numpy as np
import tensorflow as tf
import random
from termcolor import cprint
from time import time

from tensorflow.keras.layers import Input, Dense, Conv1D
from tensorflow.keras.layers import Dropout, GlobalAveragePooling1D
from tensorflow.keras.models import Model
from tensorflow.keras.optimizers import Adam
import tensorflow.keras.backend as K
#from keras.layers import Input, Dense, Conv1D
#from keras.layers import Dropout, GlobalAveragePooling1D
#from keras.models import Model 
#from keras.optimizers import Adam
#import keras.backend as K

#tf.compat.v1.disable_eager_execution()
#tf.enable_eager_execution()

def reset_seeds(reset_graph_with_backend=None, verbose=1):
    if reset_graph_with_backend is not None:
        K = reset_graph_with_backend
        K.clear_session()
        tf.compat.v1.reset_default_graph()
        if verbose:
            print("KERAS AND TENSORFLOW GRAPHS RESET")

    np.random.seed(1)
    random.seed(2)
    if tf.__version__[0] == '2':
        tf.random.set_seed(3)
    else:
        tf.set_random_seed(3)
    if verbose:
        print("RANDOM SEEDS RESET")

print("TF version: {}".format(tf.__version__))
reset_seeds()

def timeit(func, iterations, *args, _verbose=0, **kwargs):
    t0 = time()
    for _ in range(iterations):
        func(*args, **kwargs)
        print(end='.'*int(_verbose))
    print("Time/iter: %.4f sec" % ((time() - t0) / iterations))

def make_model_small(batch_shape):
    ipt   = Input(batch_shape=batch_shape)
    x     = Conv1D(128, 40, strides=4, padding='same')(ipt)
    x     = GlobalAveragePooling1D()(x)
    x     = Dropout(0.5)(x)
    x     = Dense(64, activation='relu')(x)
    out   = Dense(1,  activation='sigmoid')(x)
    model = Model(ipt, out)
    model.compile(Adam(lr=1e-4), 'binary_crossentropy')
    return model

def make_model_medium(batch_shape):
    ipt = Input(batch_shape=batch_shape)
    x = ipt
    for filters in [64, 128, 256, 256, 128, 64]:
        x  = Conv1D(filters, 20, strides=1, padding='valid')(x)
    x     = GlobalAveragePooling1D()(x)
    x     = Dense(256, activation='relu')(x)
    x     = Dropout(0.5)(x)
    x     = Dense(128, activation='relu')(x)
    x     = Dense(64,  activation='relu')(x)
    out   = Dense(1,   activation='sigmoid')(x)
    model = Model(ipt, out)
    model.compile(Adam(lr=1e-4), 'binary_crossentropy')
    return model

def make_model_large(batch_shape):
    ipt   = Input(batch_shape=batch_shape)
    x     = Conv1D(64,  400, strides=4, padding='valid')(ipt)
    x     = Conv1D(128, 200, strides=1, padding='valid')(x)
    for _ in range(40):
        x = Conv1D(256,  12, strides=1, padding='same')(x)
    x     = Conv1D(512,  20, strides=2, padding='valid')(x)
    x     = Conv1D(1028, 10, strides=2, padding='valid')(x)
    x     = Conv1D(256,   1, strides=1, padding='valid')(x)
    x     = GlobalAveragePooling1D()(x)
    x     = Dense(256, activation='relu')(x)
    x     = Dropout(0.5)(x)
    x     = Dense(128, activation='relu')(x)
    x     = Dense(64,  activation='relu')(x)    
    out   = Dense(1,   activation='sigmoid')(x)
    model = Model(ipt, out)
    model.compile(Adam(lr=1e-4), 'binary_crossentropy')
    return model

def make_data(batch_shape):
    return np.random.randn(*batch_shape), \
           np.random.randint(0, 2, (batch_shape[0], 1))
           
def make_data_tf(batch_shape, n_batches, iters):
    data = np.random.randn(n_batches, *batch_shape),
    trgt = np.random.randint(0, 2, (n_batches, batch_shape[0], 1))
    return tf.data.Dataset.from_tensor_slices((data, trgt))#.repeat(iters)

batch_shape_small  = (32, 140,   30)
batch_shape_medium = (32, 1400,  30)
batch_shape_large  = (32, 14000, 30)

batch_shapes = batch_shape_small, batch_shape_medium, batch_shape_large
make_model_fns = make_model_small, make_model_medium, make_model_large
iterations = [200, 100, 50]
shape_names = ["Small data",  "Medium data",  "Large data"]
model_names = ["Small model", "Medium model", "Large model"]

def test_all(fit=False, tf_dataset=False):
    for model_fn, model_name, iters in zip(make_model_fns, model_names, iterations):
        for batch_shape, shape_name in zip(batch_shapes, shape_names):
            if (model_fn is make_model_large) and (batch_shape == batch_shape_small):
                continue
            reset_seeds(reset_graph_with_backend=K)
            if tf_dataset:
                data = make_data_tf(batch_shape, iters, iters)
            else:
                data = make_data(batch_shape)
            model = model_fn(batch_shape)

            if fit:
                if tf_dataset:
                    model.train_on_batch(data.take(1))
                    t0 = time()
                    model.fit(data, steps_per_epoch=iters)
                    print("Time/iter: %.4f sec" % ((time() - t0) / iters))
                else:
                    model.train_on_batch(*data)
                    timeit(model.fit, iters, *data, _verbose=1, verbose=0)
            else:
                model.train_on_batch(*data)
                timeit(model.train_on_batch, iters, *data, _verbose=1)
            cprint(">> {}, {} done <<\n".format(model_name, shape_name), 'blue')
            del model

test_all(fit=True, tf_dataset=False)

I am unsure if your code is correct. I think that your models always run in graph mode since you call `model.compile` without `run_eagerly=True` argument. If in eager mode, you can run part of your code in graph mode using `tf.function`. Therefore I think that default implementation of `compile` is to create computational graph instead of running it eagerly for performance reasons. Also note that if your model is convolutional then you dont see speedup in graph mode since python interaction is minimal. If you do many math ops than it can make big difference (also in memory utilization). — user2781994, Nov 22 '19 at 15:36
@OverLordGoldDragon but in TF 2, eager mode is by default but `model.compile` without `run_eagerly=True` ensures graph mode, or not? — user2781994, Nov 22 '19 at 16:07
@OverLordGoldDragon I agree that not all imported methods run in graph mode but I think that either `model.compile` or `model.fit` must ensure that the training runs in graph mode internally. — user2781994, Nov 22 '19 at 16:52
@OverLordGoldDragon TRUE - "tf.keras.Model.compile takes three important arguments: ... Additionally, to make sure the model trains and evaluates eagerly, you can make sure to pass `run_eagerly=True` as a parameter to compile." (source https://www.tensorflow.org/guide/keras/overview) Therefore I if you dont pass `run_eagerly=True` model CAN run in graph mode. I am unsure what is the deciding factor but why wouldnt it run in graph mode if it is more efficient than eager. — user2781994, Nov 22 '19 at 17:05
Do you want more evidence? :) "By default, we will attempt to compile your model to a static graph to deliver the best execution performance." (https://github.com/tensorflow/tensorflow/blob/r2.0/tensorflow/python/keras/engine/training.py#L445) — user2781994, Nov 22 '19 at 17:31
@user2781994 False; lots of clutter now, so I'll delete other comments and summarize here: (1) graph/eager can be enabled/disabled as described in the answer; (2) `run_eagerly` makes _no difference_ on performance, or graph building in my env - exact same fit and dataset loops are instantiated. There appear to be a few conditionals where it's _possible_ to alter building, but they involve special hardware/distrib. configs that's beyond typical usage. -- I have no interest to pursue this further - you can if you wish, but show bench results and source code; docs can easily be wrong/misleading. — OverLordGoldDragon, Nov 22 '19 at 17:57
Disclaimer: I didn't investigate graph build differences entirely - `compile` in particular is less introspect-friendly in TF2 - but same fit and dataset loops are indeed built, and I saw no difference in performance across all model/data sizes. — OverLordGoldDragon, Nov 22 '19 at 18:08
@OverLordGoldDragon Seems you dont know what is the difference between running a function eagerly and running a function in graph mode. A function, in this case the whole model can be run in graph mode using `tf.function` even if you dont use `tf.compat.v1.disable_eager_execution()` - that makes perfect sense since running anything in graph mode must be at least as fast as in eager mode (by definition, obviously there may be bugs) so you dont want to remove the posibility of using graph mode in new version of TF. — user2781994, Nov 22 '19 at 18:08
It makes sense that you dont see differences since performence loss of model with computationally heavy convolutions/dense layers in eager mode is minimal. But my point is that the stuff you are measuring may be in many cases only the input pipeline, not the actual model execution. — user2781994, Nov 22 '19 at 18:12

Why is TensorFlow 2 much slower than TensorFlow 1?

2 Answers2

Linked