TensorFlow Dataset.shuffle - large dataset

Question

I'm using TensorFlow 1.2 with a dataset in a 20G TFRecord file. There is about half a million samples in that TFRecord file.

Looks like if I choose a value smaller than the amount of records in the dataset for buffer_size, only the first N records in the TFRecord will be used. https://www.tensorflow.org/api_docs/python/tf/contrib/data/Dataset#shuffle

For example, if buffer_size = 100, seems like only the first 100 records are ever used.

Question

Should buffer_size always be the length of the dataset? Would that impact training performance?

score 19 · Answer 1 · answered Dec 12 '17 at 21:25

19

No matter what buffer size you will choose, all samples will be used, it only affects the randomness of the shuffle.

If buffer size is 100, it means that Tensorflow will keep a buffer of the next 100 samples, and will randomly select one those 100 samples. it then adds the next element to the buffer.

so, if buffer_size = 1 there is no shuffle at all, and if buffer_size > data_set_size a perfect uniform random shuffle is guaranteed.

I would highly suggest to shuffle the data set before creating the TFrecords, and keep a small buffer size.

answered Dec 12 '17 at 21:25

Matan Hugi

1,110
8
16

So does `buffer_size` imply that those records will be in memory? Why would you suggest to keep it small? – rodrigo-silveira Dec 12 '17 at 22:06
It depends what is the size of each sample. from my experience, it takes long to start training when the buffer size is 10,000 and each sample is an image. – Matan Hugi Dec 12 '17 at 22:08
please, have a look at [46444018](https://stackoverflow.com/questions/46444018/meaning-of-buffer-size-in-dataset-map-dataset-prefetch-and-dataset-shuffle) to get a better idea of the underlying behavior – Max F. Feb 27 '18 at 14:12

TensorFlow Dataset.shuffle - large dataset

Question

1 Answers1