Shuffling input files with tensorflow Datasets

Question

With the old input-pipeline API I can do:

filename_queue = tf.train.string_input_producer(filenames, shuffle=True)

and then pass the filenames to other queue, for example:

reader = tf.TFRecordReader()
_, serialized_example = reader.read_up_to(filename_queue, n)

How can I achieve similar behaviour with the Dataset -API?

The tf.data.TFRecordDataset() expects tensor of file-names in fixed order.

Have a look at [this presentation](https://docs.google.com/presentation/d/16kHNtQslt-yuJ3w8GIx-eEH6t_AvFeQOchqGRFpAD7U/edit#slide=id.g254d08e080_0_370) from the developer of `tf.Data` as well as [this answer](https://stackoverflow.com/a/48713164/6246880). — BiBi, Dec 29 '18 at 19:20

GPhilo · Answer 1 · 2018-02-09T11:46:06.497

8

Start reading them in order, shuffle right after:

BUFFER_SIZE = 1000 # arbitrary number
# define filenames somewhere, e.g. via glob
dataset = tf.data.TFRecordDataset(filenames).shuffle(BUFFER_SIZE)

EDIT:

The input pipeline of this question gave me an idea on how to implement filenames shuffling with the Dataset API:

dataset = tf.data.Dataset.from_tensor_slices(filenames)
dataset = dataset.shuffle(BUFFER_SIZE) # doesn't need to be big
dataset = dataset.flat_map(tf.data.TFRecordDataset)
dataset = dataset.map(decode_example, num_parallel_calls=5) # add your decoding logic here
# further processing of the dataset

This will put all the data of one file before the one of the next and so on. Files are shuffled, but the data inside them will be produced in the same order. You can alternatively replace dataset.flat_map with interleave to process multiple files at the same time and return samples from each:

dataset = dataset.interleave(tf.data.TFRecordDataset, cycle_length=4)

Note: interleave does not actually run in multiple threads, it's a round-robin operation. For true parallel processing see parallel_interleave

edited Feb 09 '18 at 11:46

answered Dec 05 '17 at 09:34

GPhilo

18,519
9
63
89

Will this shuffle my files or the data inside the files? – Pekka Dec 05 '17 at 10:47
It will shuffle the TFRecords extracted by the files – GPhilo Dec 05 '17 at 11:46
2

OK, but what do you do when you have a long series of TFRecord files (with a total of more than 50000 examples) containing the same label (for deep learning) and then, another series of files containing examples with another label. For shuffling to work, you would need a buffer larger than 50000, and thus a lot of RAM. This is not a solution. Shuffling filenames comes as a much easier solution. – ma3oun Feb 01 '18 at 09:48
If you have a lot of TFRecord files, each with one sample inside, you're building your TFRecord files wrong. The whole point of TFRecord files is to have one big file that act as samples container and can be stored efficiently, while at the same allowing for fast extraction of samples. Of course, if you have 50k files it's faster to shuffle filenames beforehand, but in that case the problem is not the pipeline, it's your usage of TFRecords. – GPhilo Feb 01 '18 at 09:55
@GPhilo Common use case for me is for example to serialize training data (millions/billions of records) from spark. This data can already by partitioned in, say. 200-500 files. I think it wouldn't make sense to force everything in one big file. In this kind of case lot's of free (in terms of RAM) shuffle can be achieved by shuffling the files. – Pekka Feb 02 '18 at 11:39
1

I'm not suggesting you should pack everything in one big file, your use-case seems very reasonable to me. The problem I'm pointing out is, if you shuffle just the file names you'll still have the data inside each file read in the same order. I agree that shuffling that too doesn't hurt, but you'll still need a `shuffle()` with a buffer after you decode the samples, unless you're OK with having them always in the same order. – GPhilo Feb 02 '18 at 11:59
Yes, we still need a buffer but the required buffer size is very much smaller if we shuffle the files first. – Pekka Feb 04 '18 at 07:00
2

@Pekka I think the edit might be what you are aiming for – GPhilo Feb 09 '18 at 11:46
@GPhilo Thanks, I will check it out today or tomorrow before accept. (I didn't down vote the answer). – Pekka Feb 13 '18 at 10:22
Distributing across multiple TFRecords files is also recommended for distributed training IIRC. And when you load a dataset with TF Datasets (https://www.tensorflow.org/datasets) you also have an option for shuffling the files (and a separate option for deterministic/non-deterministic file order). Thank you for the code GPhilo! – grofte Sep 02 '20 at 10:46
1

Glad it's helpful, just keep in mind that this code was written for TF 1.4 (I think, or close to that), the dataset API evolved enormously ever since, so some things may be implementable in a more efficient way today :) – GPhilo Sep 02 '20 at 10:51
From looking around the web it seems that your's is the preferred approach (https://datascience.stackexchange.com/questions/16318/what-is-the-benefit-of-splitting-tfrecord-file-into-shards). I would also set deterministic=False for .interleave() to avoid wasting performance on a feature that would reduce shuffling =) – grofte Sep 02 '20 at 11:10

score -1 · Answer 2 · answered Feb 01 '18 at 09:55

-1

The current Tensorflow version (v1.5 in 02/2018) does not seem to support filename shuffling natively in the Dataset API. Here is a simple work around using numpy:

import numpy as np
import tensorflow as tf

myShuffledFileList = np.random.choice(myInputFileList, size=len(myInputFileList), replace=False).tolist()

dataset = tf.data.TFRecordDataset(myShuffledFileList)

answered Feb 01 '18 at 09:55

ma3oun

3,681
1
21
33

1

Loading the file list dynamically: `tf.data.Dataset.list_files('pattern-here').shuffle(BUFFER_SIZE)`. Hardcoding it: `tf.data.Dataset.from_tensor_slices([filenames]).shuffle(BUFFER_SIZE)`. Both must be followed by an appropriate `.map` with a decode function that opens and reads the records in the file. Again, how is that not possible with the current API? Also, if you *really* want to use `numpy`, `np.random.shuffle(myInputFileList)`. – GPhilo Feb 01 '18 at 10:07

Shuffling input files with tensorflow Datasets

2 Answers2

EDIT: