Why is this tensorflow training taking so long?

Question

I'm learning DRL with the book Deep Reinforcement Learning in Action. In chapter 3, they present the simple game Gridworld (instructions here, in the rules section) with the corresponding code in PyTorch.

I've experimented with the code and it takes less than 3 minutes to train the network with 89% of wins (won 89 of 100 games after training).

As an exercise, I have migrated the code to tensorflow. All the code is here.

The problem is that with my tensorflow port it takes near 2 hours to train the network with a win rate of 84%. Both versions are using the only CPU to train (I don't have GPU)

Training loss figures seem correct and also the rate of a win (we have to take into consideration that the game is random and can have impossible states). The problem is the performance of the overall process.

I'm doing something terribly wrong, but what?

The main differences are in the training loop, in torch is this:

        loss_fn = torch.nn.MSELoss()
        learning_rate = 1e-3
        optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)
        ....
        Q1 = model(state1_batch) 
        with torch.no_grad():
            Q2 = model2(state2_batch) #B
        
        Y = reward_batch + gamma * ((1-done_batch) * torch.max(Q2,dim=1)[0])
        X = Q1.gather(dim=1,index=action_batch.long().unsqueeze(dim=1)).squeeze()
        loss = loss_fn(X, Y.detach())
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

and in the tensorflow version:

        loss_fn = tf.keras.losses.MSE
        learning_rate = 1e-3
        optimizer = tf.keras.optimizers.Adam(learning_rate)
        ...
        Q2 = model2(state2_batch) #B
        with tf.GradientTape() as tape:
            Q1 = model(state1_batch)
            Y = reward_batch + gamma * ((1-done_batch) * tf.math.reduce_max(Q2, axis=1))
            X = [Q1[i][action_batch[i]] for i in range(len(action_batch))]
            loss = loss_fn(X, Y)
        grads = tape.gradient(loss, model.trainable_variables)
        optimizer.apply_gradients(zip(grads, model.trainable_variables))

Why is the training taking so long?

Can you try by disabling the eager execution and see what it gives? `tf.compat.v1.disable_eager_execution()` — Innat, May 04 '21 at 11:18
I haven't finished the loop because it was taking too much time: 228 from 5000 epochs took 14 minutes. So, it seems a little bit slower than without the disable_eager. — Ivan, May 04 '21 at 13:12
The only visible difference is that without `disable_eager_execution` the CPU was around 30-35%. With `disable_eager_execution` it arrives almost to 100% and the interface feels very slow. — Ivan, May 04 '21 at 13:13
How about disable the eager mode of `tf.function` using `tf.config.run_functions_eagerly(False)`. What's your `tf` version? Also check [this](https://stackoverflow.com/questions/62681257/tf-keras-model-predict-is-slower-than-straight-numpy/67238117#67238117) too. — Innat, May 04 '21 at 13:18
Same slow behaviour. I also checked it in Google Colab and it's the same. My current tensorflow version is 2.4.1 — Ivan, May 04 '21 at 14:24
Hi @Gulzar, both codes (PyTorch and Tensorflow) are linked in the question — Ivan, May 06 '21 at 17:01
`[Q1[i][action_batch[i]] for i in range(len(action_batch))]` this one seems a bit fishy. Have you considered using `tf.gather_nd` ? — Sebastian Hoffmann, May 06 '21 at 19:17
Try passing `training=False or True` explicitly. `Q2 = model2(state2_batch, training=False)` and `Q1 = model(state1_batch, training=True)` — SajanGohil, May 12 '21 at 10:16

score 4 · Accepted Answer · edited May 13 '21 at 12:42

Why is TensorFlow slow

TensorFlow has 2 execution modes: eager execution, and graph mode. TensorFlow default behavior, since version 2, is to default to eager execution. Eager execution is great as it enables you to write code close to how you would write standard python. It's easier to write, and it's easier to debug. Unfortunately, it's really not as fast as graph mode.

So the idea is, once the function is prototyped in eager mode, to make TensorFlow execute it in graph mode. For that you can use tf.function. tf.function compiles a callable into a TensorFlow graph. Once the function is compiled into a graph, the performance gain is usually quite important. The recommended approach when developing in TensorFlow is the following:

Debug in eager mode, then decorate with @tf.function.

Don't rely on Python side effects like object mutation or list appends.

tf.function works best with TensorFlow ops; NumPy and Python calls are converted to constants.

I would add: think about the critical parts of your program, and which ones should be converted first into graph mode. It's usually the parts where you call a model to get a result. It's where you will see the best improvements.

You can find more information in the following guides:

Applying `tf.function` to your code

So, there are at least two things you can change in your code to make it run quite faster:

The first one is to not use model.predict on a small amount of data. The function is made to work on a huge dataset or on a generator. (See this comment on Github). Instead, you should call the model directly, and for performance enhancement, you can wrap the call to the model in a tf.function.

Model.predict is a top-level API designed for batch-predicting outside of any loops, with the fully-features of the Keras APIs.

The second one is to make your training step a separate function, and to decorate that function with @tf.function.

So, I would declare the following things before your training loop:

# to call instead of model.predict
model_func = tf.function(model)

def get_train_func(model, model2, loss_fn, optimizer):
    """Wrapper that creates a train step using the two model passed"""
    @tf.function
    def train_func(state1_batch, state2_batch, done_batch, reward_batch, action_batch):
        Q2 = model2(state2_batch) #B
        with tf.GradientTape() as tape:
            Q1 = model(state1_batch)
            Y = reward_batch + gamma * ((1-done_batch) * tf.math.reduce_max(Q2, axis=1))
            # gather is more efficient than a list comprehension, and needed in a tf.function
            X = tf.gather(Q1, action_batch, batch_dims=1)
            loss = loss_fn(X, Y)
        grads = tape.gradient(loss, model.trainable_variables)
        optimizer.apply_gradients(zip(grads, model.trainable_variables))
        return loss
    return train_func

# train step is a callable 
train_step = get_train_func(model, model2, loss_fn, optimizer)

And you can use that function in your training loop:

if len(replay) > batch_size:
    minibatch = random.sample(replay, batch_size)
    state1_batch = np.array([s1 for (s1,a,r,s2,d) in minibatch]).reshape((batch_size, 64))
    action_batch = np.array([a for (s1,a,r,s2,d) in minibatch])   #TODO: Posibles diferencies
    reward_batch = np.float32([r for (s1,a,r,s2,d) in minibatch])
    state2_batch = np.array([s2 for (s1,a,r,s2,d) in minibatch]).reshape((batch_size, 64))
    done_batch = np.array([d for (s1,a,r,s2,d) in minibatch]).astype(np.float32)

    loss = train_step(state1_batch, state2_batch, done_batch, reward_batch, action_batch)
    losses.append(loss)

There are other changes that you could make to make your code more TensorFlowesque, but with those modifications, your code takes ~2 minutes on my CPU. (with a 97% win rate).

Ok, I'll try when come back home, but.. why? why these changes are changing hours to minutes? In the case of the first change, I can understand that one function is optimized for small amounts of data or the other for huge ones, but how the second change improves the execution time? — Ivan, May 06 '21 at 17:09
That's a good point. I added an explanation as well as some links to guides that should help you understand why a bit better. — Lescurel, May 06 '21 at 19:08
Answer is good, just a side note I would add that there is option to [profile tensorflow using tensorboard](https://www.tensorflow.org/tensorboard/tensorboard_profiling_keras) — dankal444, May 11 '21 at 12:05
@Lescurel A very insightful answer, no doubt. But I think this becomes more like the **way to optimize the TF model**. Eager mode is slow, compared to graph mode but in OP's approach both TF and PyTorch are in eager mode but yet there is a significant execution gap. And I think that's the major point here; any pointer on this? — Innat, May 13 '21 at 12:38
@Ivan , it would be great if you also make a GitHub issue to **TF** by mentioning your findings. — Innat, May 13 '21 at 12:39
@M.Innat I don't know if this is a bug or not. I'm a beginner with TF. If you are confident enough about its a bug, please don't hesitate to inform about it. — Ivan, May 15 '21 at 16:16

Why is this tensorflow training taking so long?

1 Answers1

Why is TensorFlow slow

Applying `tf.function` to your code

Linked

Why is this tensorflow training taking so long?

1 Answers1

Why is TensorFlow slow

Applying tf.function to your code

Linked

Applying `tf.function` to your code