I'm learning DRL with the book Deep Reinforcement Learning in Action. In chapter 3, they present the simple game Gridworld (instructions here, in the rules section) with the corresponding code in PyTorch.
I've experimented with the code and it takes less than 3 minutes to train the network with 89% of wins (won 89 of 100 games after training).
As an exercise, I have migrated the code to tensorflow. All the code is here.
The problem is that with my tensorflow port it takes near 2 hours to train the network with a win rate of 84%. Both versions are using the only CPU to train (I don't have GPU)
Training loss figures seem correct and also the rate of a win (we have to take into consideration that the game is random and can have impossible states). The problem is the performance of the overall process.
I'm doing something terribly wrong, but what?
The main differences are in the training loop, in torch is this:
loss_fn = torch.nn.MSELoss()
learning_rate = 1e-3
optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)
....
Q1 = model(state1_batch)
with torch.no_grad():
Q2 = model2(state2_batch) #B
Y = reward_batch + gamma * ((1-done_batch) * torch.max(Q2,dim=1)[0])
X = Q1.gather(dim=1,index=action_batch.long().unsqueeze(dim=1)).squeeze()
loss = loss_fn(X, Y.detach())
optimizer.zero_grad()
loss.backward()
optimizer.step()
and in the tensorflow version:
loss_fn = tf.keras.losses.MSE
learning_rate = 1e-3
optimizer = tf.keras.optimizers.Adam(learning_rate)
...
Q2 = model2(state2_batch) #B
with tf.GradientTape() as tape:
Q1 = model(state1_batch)
Y = reward_batch + gamma * ((1-done_batch) * tf.math.reduce_max(Q2, axis=1))
X = [Q1[i][action_batch[i]] for i in range(len(action_batch))]
loss = loss_fn(X, Y)
grads = tape.gradient(loss, model.trainable_variables)
optimizer.apply_gradients(zip(grads, model.trainable_variables))
Why is the training taking so long?