I'm using TensorFlow 1.2 with a dataset in a 20G TFRecord file. There is about half a million samples in that TFRecord file.
Looks like if I choose a value smaller than the amount of records in the dataset for buffer_size
, only the first N records in the TFRecord will be used. https://www.tensorflow.org/api_docs/python/tf/contrib/data/Dataset#shuffle
For example, if buffer_size = 100
, seems like only the first 100 records are ever used.
Question
Should buffer_size
always be the length of the dataset? Would that impact training performance?