I use Dataset API in TensorFlow. When I train a CNN, I found that each time after dataset fills the shuffle buffer, my loss raises very high (loss same as when initializing). But it converges faster than the very beginning. (e.g. loss drops from 10 to 8 take step 1 to step 500, but after filling the buffer, it raises back to 10, but only take 100 or fewer steps to drop to 8 again.)
Training log (get a very high loss when step 960):
2018-07-15 18:06:37.125745: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:94] Filling up shuffle buffer (this may take a while): 9594 of 10240
2018-07-15 18:06:37.852214: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:129] Shuffle buffer filled.
INFO:tensorflow:step: 0 loss: 10.590 acc: 0.000 time: 22.603
INFO:tensorflow:step: 1 loss: 9.966 acc: 0.039 time: 2.598
INFO:tensorflow:step: 2 loss: 8.744 acc: 0.033 time: 2.612
INFO:tensorflow:step: 3 loss: 10.865 acc: 0.020 time: 2.583
INFO:tensorflow:step: 4 loss: 7.930 acc: 0.047 time: 2.602
INFO:tensorflow:step: 5 loss: 8.070 acc: 0.027 time: 2.553
INFO:tensorflow:step: 6 loss: 8.437 acc: 0.031 time: 2.588
INFO:tensorflow:step: 7 loss: 8.677 acc: 0.039 time: 2.582
INFO:tensorflow:step: 8 loss: 8.571 acc: 0.021 time: 2.594
INFO:tensorflow:step: 9 loss: 8.581 acc: 0.033 time: 2.582
INFO:tensorflow:step: 10 loss: 8.333 acc: 0.006 time: 2.581
**************************omit some steps**************************
INFO:tensorflow:step: 954 loss: 8.333 acc: 0.008 time: 1.982
INFO:tensorflow:step: 955 loss: 8.385 acc: 0.006 time: 1.982
INFO:tensorflow:step: 956 loss: 8.297 acc: 0.006 time: 1.962
INFO:tensorflow:step: 957 loss: 8.173 acc: 0.004 time: 1.961
INFO:tensorflow:step: 958 loss: 8.155 acc: 0.004 time: 1.987
INFO:tensorflow:step: 959 loss: 8.189 acc: 0.018 time: 1.993
2018-07-15 18:46:42.820175: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:94] Filling up shuffle buffer (this may take a while): 9907 of 10240
2018-07-15 18:46:43.191982: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:129] Shuffle buffer filled.
INFO:tensorflow:step: 960 loss: 12.504 acc: 0.000 time: 12.849
INFO:tensorflow:step: 961 loss: 11.177 acc: 0.000 time: 2.369
INFO:tensorflow:step: 962 loss: 10.289 acc: 0.000 time: 2.445
INFO:tensorflow:step: 963 loss: 9.984 acc: 0.000 time: 2.425
INFO:tensorflow:step: 964 loss: 9.824 acc: 0.000 time: 2.458
Code to build a Dataset:
# read data list from disk
face_classes = sorted(os.listdir(data_dir))
num_class = len(face_classes)
class2index = {k: v for v, k in enumerate(face_classes)}
data_list = map(lambda cls: list_images(data_dir, cls, class2index), face_classes)
# flat the list
data_list = [item for sublist in data_list for item in sublist]
# create a dataset
dataset = tf.data.Dataset.from_tensor_slices(
(list(map(lambda item: item[0], data_list)),
list(map(lambda item: item[1], data_list))))
dataset = dataset.prefetch(batch_size * 100)
dataset = dataset.map(decode_data) # decode file name to image
dataset = dataset.shuffle(50 * batch_size).repeat(epoch_num).batch(batch_size)
return dataset.make_one_shot_iterator().get_next()
First I thought this is the same bug as https://stackoverflow.com/a/43670684/5634636, but after I add sorted()
to os.listdir(data_dir)
, this problem still exists. This phenomenon looks very very like this problem, but I can not sure and don't know how to fix it. What happens when filling the shuffle buffer?