Does `tf.data.Dataset.repeat()` buffer the entire dataset in memory?

Question

Looking at this code example from the TF documentation:

filenames = ["/var/data/file1.tfrecord", "/var/data/file2.tfrecord"]
dataset = tf.data.TFRecordDataset(filenames)
dataset = dataset.map(...)
dataset = dataset.shuffle(buffer_size=10000)
dataset = dataset.batch(32)
dataset = dataset.repeat(num_epochs)
iterator = dataset.make_one_shot_iterator()

Does the dataset.repeat(num_epochs) require that the entire dataset be loaded into memory? Or is it re-initializing the dataset(s) that came before it when it receives an end-of-dataset exception?

The documentation is ambiguous about this point.

score 0 · Accepted Answer · edited Nov 09 '17 at 01:24

Based on this simple test it appears that repeat does not buffer the dataset, it must be re-initializing the upstream datasets.

n = tf.data.Dataset.range(5).shuffle(buffer_size=5).repeat(2).make_one_shot_iterator().get_next()
[sess.run(n) for _ in range(10)]
Out[83]: [2, 0, 3, 1, 4, 3, 1, 0, 2, 4]

Logic suggests that if repeat were buffering its input, the same random shuffle pattern would have have been repeated in this simple experiment.

Does `tf.data.Dataset.repeat()` buffer the entire dataset in memory?

1 Answers1

Linked