0

I use Dataset API in TensorFlow. When I train a CNN, I found that each time after dataset fills the shuffle buffer, my loss raises very high (loss same as when initializing). But it converges faster than the very beginning. (e.g. loss drops from 10 to 8 take step 1 to step 500, but after filling the buffer, it raises back to 10, but only take 100 or fewer steps to drop to 8 again.)

Training log (get a very high loss when step 960):

2018-07-15 18:06:37.125745: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:94] Filling up shuffle buffer (this may take a while): 9594 of 10240
2018-07-15 18:06:37.852214: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:129] Shuffle buffer filled.
INFO:tensorflow:step: 0 loss: 10.590 acc: 0.000 time: 22.603
INFO:tensorflow:step: 1 loss: 9.966 acc: 0.039 time: 2.598
INFO:tensorflow:step: 2 loss: 8.744 acc: 0.033 time: 2.612
INFO:tensorflow:step: 3 loss: 10.865 acc: 0.020 time: 2.583
INFO:tensorflow:step: 4 loss: 7.930 acc: 0.047 time: 2.602
INFO:tensorflow:step: 5 loss: 8.070 acc: 0.027 time: 2.553
INFO:tensorflow:step: 6 loss: 8.437 acc: 0.031 time: 2.588
INFO:tensorflow:step: 7 loss: 8.677 acc: 0.039 time: 2.582
INFO:tensorflow:step: 8 loss: 8.571 acc: 0.021 time: 2.594
INFO:tensorflow:step: 9 loss: 8.581 acc: 0.033 time: 2.582
INFO:tensorflow:step: 10 loss: 8.333 acc: 0.006 time: 2.581
**************************omit some steps**************************
INFO:tensorflow:step: 954 loss: 8.333 acc: 0.008 time: 1.982
INFO:tensorflow:step: 955 loss: 8.385 acc: 0.006 time: 1.982
INFO:tensorflow:step: 956 loss: 8.297 acc: 0.006 time: 1.962
INFO:tensorflow:step: 957 loss: 8.173 acc: 0.004 time: 1.961
INFO:tensorflow:step: 958 loss: 8.155 acc: 0.004 time: 1.987
INFO:tensorflow:step: 959 loss: 8.189 acc: 0.018 time: 1.993
2018-07-15 18:46:42.820175: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:94] Filling up shuffle buffer (this may take a while): 9907 of 10240
2018-07-15 18:46:43.191982: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:129] Shuffle buffer filled.
INFO:tensorflow:step: 960 loss: 12.504 acc: 0.000 time: 12.849
INFO:tensorflow:step: 961 loss: 11.177 acc: 0.000 time: 2.369
INFO:tensorflow:step: 962 loss: 10.289 acc: 0.000 time: 2.445
INFO:tensorflow:step: 963 loss: 9.984 acc: 0.000 time: 2.425
INFO:tensorflow:step: 964 loss: 9.824 acc: 0.000 time: 2.458

Code to build a Dataset:

# read data list from disk
face_classes = sorted(os.listdir(data_dir))
num_class = len(face_classes)
class2index = {k: v for v, k in enumerate(face_classes)}
data_list = map(lambda cls: list_images(data_dir, cls, class2index), face_classes)
# flat the list
data_list = [item for sublist in data_list for item in sublist]

# create a dataset
dataset = tf.data.Dataset.from_tensor_slices(
                (list(map(lambda item: item[0], data_list)),
                 list(map(lambda item: item[1], data_list))))

dataset = dataset.prefetch(batch_size * 100)
dataset = dataset.map(decode_data) # decode file name to image
dataset = dataset.shuffle(50 * batch_size).repeat(epoch_num).batch(batch_size)

return dataset.make_one_shot_iterator().get_next()

First I thought this is the same bug as https://stackoverflow.com/a/43670684/5634636, but after I add sorted() to os.listdir(data_dir), this problem still exists. This phenomenon looks very very like this problem, but I can not sure and don't know how to fix it. What happens when filling the shuffle buffer?

huangbiubiu
  • 1,252
  • 20
  • 37
  • Try to increase batch_size – FindOutIslamNow Sep 10 '18 at 11:28
  • @FindOutIslamNow why increase `batch_size` can fix that? What's behind this idea? I tried but it seems that it doesn't work. – huangbiubiu Sep 11 '18 at 10:55
  • In fact, I stilled increasing batch_size and the error was still occuring. The error was only because I have features that need some processing. I reduced data, deleted some feature columns, and at final the training succeeded for me. Also I have found some `nan` values in my dataset, I have detected them using `np.isnan(dataset)`, (maybe nan causes this bug also !?) – FindOutIslamNow Sep 11 '18 at 11:25
  • For those who meet the same problem: I gave up solving this problem. I rewrite the code, preprocess the dataset with OpenCV and saved preprocessed data into disk. While training, I read preprocessed data with label directly then there is no this bug anymore. Hope someone can find the solution. – huangbiubiu Sep 14 '18 at 03:15

0 Answers0