2

Looking at this code example from the TF documentation:

filenames = ["/var/data/file1.tfrecord", "/var/data/file2.tfrecord"]
dataset = tf.data.TFRecordDataset(filenames)
dataset = dataset.map(...)
dataset = dataset.shuffle(buffer_size=10000)
dataset = dataset.batch(32)
dataset = dataset.repeat(num_epochs)
iterator = dataset.make_one_shot_iterator()

Does the dataset.repeat(num_epochs) require that the entire dataset be loaded into memory? Or is it re-initializing the dataset(s) that came before it when it receives an end-of-dataset exception?

The documentation is ambiguous about this point.

Engineero
  • 12,340
  • 5
  • 53
  • 75
David Parks
  • 30,789
  • 47
  • 185
  • 328

1 Answers1

0

Based on this simple test it appears that repeat does not buffer the dataset, it must be re-initializing the upstream datasets.

n = tf.data.Dataset.range(5).shuffle(buffer_size=5).repeat(2).make_one_shot_iterator().get_next()
[sess.run(n) for _ in range(10)]
Out[83]: [2, 0, 3, 1, 4, 3, 1, 0, 2, 4]

Logic suggests that if repeat were buffering its input, the same random shuffle pattern would have have been repeated in this simple experiment.

Engineero
  • 12,340
  • 5
  • 53
  • 75
David Parks
  • 30,789
  • 47
  • 185
  • 328