Start reading them in order, shuffle right after:
BUFFER_SIZE = 1000 # arbitrary number
# define filenames somewhere, e.g. via glob
dataset = tf.data.TFRecordDataset(filenames).shuffle(BUFFER_SIZE)
EDIT:
The input pipeline of this question gave me an idea on how to implement filenames shuffling with the Dataset API:
dataset = tf.data.Dataset.from_tensor_slices(filenames)
dataset = dataset.shuffle(BUFFER_SIZE) # doesn't need to be big
dataset = dataset.flat_map(tf.data.TFRecordDataset)
dataset = dataset.map(decode_example, num_parallel_calls=5) # add your decoding logic here
# further processing of the dataset
This will put all the data of one file before the one of the next and so on. Files are shuffled, but the data inside them will be produced in the same order.
You can alternatively replace dataset.flat_map
with interleave
to process multiple files at the same time and return samples from each:
dataset = dataset.interleave(tf.data.TFRecordDataset, cycle_length=4)
Note: interleave
does not actually run in multiple threads, it's a round-robin operation. For true parallel processing see parallel_interleave