1

I'm running tensorflow on GPU for training. I have a 1 layer GRU cell, with 800 batch size and I do 10 epochs. I see this spikes in the accuracy graph from tensorboard and I do not understand why. See the image. enter image description here

If you count the spikes, they are 10, as the number of epochs. I tried this with different configurations, reducing batch size, increasing number of layers but the spikes are still there. You can find the code here if it helps.

I use tf.RandomShuffleQueue for the data with infinite epochs, and I calculate how many steps it should do. I do not think that the problem is on how I calculate the accuracy (here). Do you have any suggestions why this happens?

EDIT min_after_dequeue=2000

linnal
  • 99
  • 1
  • 10
  • 1
    Often this is a symptom of insufficient randomness in the training data. There are a couple of ways to improve this: (1) increase `min_after_dequeue` in the `tf.RandomShuffleQueue` so that examples are sampled from a larger population, (2) read multiple files in parallel (e.g. using `tf.train.shuffle_batch_join()` or `tf.contrib.data.parallel_interleave()`) instead of one file at a time. – mrry Feb 14 '18 at 21:51
  • Which is the actual value of `min_after_dequeue` that you are using? Please, add it in the original question as an edit. Thanks. – petrux Feb 14 '18 at 22:18
  • I'll try point (2) suggested by @mrry. I edit the question with my `min_after_dequeue` value. Thanks – linnal Feb 15 '18 at 09:33
  • How are you computing accuracy? Is this eval accuracy or training accuracy? – Alexandre Passos Feb 17 '18 at 00:49

1 Answers1

0

This seems like the same problem from Tensorflow accuracy spikes in every epoch, but for a custom metric.

I wrote an answer for it already, but I can adapt the general idea here.

I couldn't track the exact place where you update/reset your metrics, or where you register them as so. Therefore, I'm assuming this might be done automatically by tensorflow. And if so, I believe the issue you're seeing is caused by a metric artifact due to averaging. You can probably see the per-batch metric by using

def on_batch_begin(batch, logs):
    model.reset_metrics()
    return

lambda_callback = tf.keras.callbacks.LambdaCallback(on_batch_begin=on_batch_begin)

and passing it when training with

model.fit(..., callbacks=[lambda_callback])

Note that this will obviously make all metrics report only the last training batch loss for each epoch.

felippeduran
  • 346
  • 3
  • 6