Question about creating a Tensorflow Dataset from data that is too big for RAM (with shuffling)

Question

I have 60 GB of .npy files spread across 20 files. I want to build a neural net in tensorflow to learn on this data.

I plan to train on 19 files in order to test on 1 file. Each file has roughly 80 columns of x data and 1 column of categorical y data. Data types are np.float64 and np.int64. I cannot reduce the data types to smaller sizes because I will lose valuable data in rounding errors.

I have no trouble loading the data into my neural net when I load a single file, but I am having trouble with training because I need to learn across all of the data. I cannot learn on the files in sequential order (for example, train on files 1-19 in order 1, 2, 3, ..., 19). I need to somehow shuffle all of the data for each epoch.

I've read posts like this one which looks almost identical to my question. However, my question is different beacuse I need to shuffle across multiple files. I have not seen a question like this answered on stackoverflow.

From what I see you have a few options: 1) randomly select files (without replacement) from 1-19 to get SOME random shuffling 2) shuffle files beforehand (e.g a helper function that mixes 2 files `(3, 15)`, `(5, 10)`, ...). Stack more shuffling on top of one another to get more shuffling 3) breakdown the dataset into many more files e.g 100 files instead of 20. Any reason you haven't tried this? 4) use `tf.data.TfRecords` (which was aluded to in the linked question). Will this not work? And why not? — IanQ, Jan 14 '19 at 21:09
Maybe you could can the tf data pipeline, using [interleave](https://www.tensorflow.org/api_docs/python/tf/data/Dataset#interleave) or [parallel_interleave](https://www.tensorflow.org/api_docs/python/tf/data/experimental/parallel_interleave). Your cycle length could be 20 in this case. You can use the [from_generator](https://www.tensorflow.org/api_docs/python/tf/data/Dataset#from_generator) to define a generator that shuffles the data within each file and yields shuffled data, and over this, apply interleave with num_parallel_calls argument or parallel_interleave with sloppy argument — kvish, Jan 15 '19 at 05:45

score 1 · Accepted Answer · answered Jan 15 '19 at 09:57

The post you linked to explains how to get a TFRecordDataset for each of the 19 data files. Then you can use tf.data.Dataset.zip to combine the TfRecordDatasets into one dataset. On this dataset you can apply shuffle. See this tensorflow tutorial for details.

The way that shuffle tf.data.Dataset works is by loading a buffer of data and shuffling it. Once it is consumed, the next buffer-size chunk of data is loaded and shuffled. I guess you can increase the randomness if needed by dividing your 19 files into smaller files, but you will pay in efficiency of computation.

Question about creating a Tensorflow Dataset from data that is too big for RAM (with shuffling)

1 Answers1