9

I have a training dataset that is too big to fit into memory, so my code reads only 1,000 records from disk at a time. Now I would like to use Tensorflow's new Dataset API. Does the Dataset API allow me to specify the number of records to keep in memory or does Tensorflow automatically manage memory so that I don't have to?

Mr_and_Mrs_D
  • 32,208
  • 39
  • 178
  • 361
user554481
  • 1,875
  • 4
  • 26
  • 47

2 Answers2

4

Yes. An example from official guide (Using the Dataset API for TensorFlow Input Pipelines, https://www.tensorflow.org/programmers_guide/datasets)

filenames = ["/var/data/file1.tfrecord", "/var/data/file2.tfrecord"]
dataset = tf.contrib.data.TFRecordDataset(filenames)
dataset = dataset.map(...) ## Parsing data with a user specified function
dataset = dataset.shuffle(buffer_size=10000) ## 10000: size of sample/record pool for random selection
dataset = dataset.repeat() ## None: keep repeating
dataset = dataset.batch(32) ## 32: number of samples/records per batch (to be read into memory)
Maosi Chen
  • 1,492
  • 2
  • 14
  • 33
1

If you will specify the number of records via batch_size. In this case TF will grab only batch_size elements from the file. You can also specify shuffle and this will guarantee that all the time in the memory will be at maximum buffer_size elements.

I verified it on my tfrecords files. I have 100 tfrecords files, each of them is ~10Gb (which is more than the memory on my laptop). And everything works fine.

Salvador Dali
  • 214,103
  • 147
  • 703
  • 753
  • Unless I've overlooked something, the docs simply say that a batch: "combines consecutive elements of this dataset into batches." I didn't see anything about memory management or how the data is read from disk. Also, what I've seen so far suggests that Datasets is replacing queues, for example see this issue: https://github.com/tensorflow/tensorflow/issues/7951#issuecomment-303744600 – user554481 Jul 16 '17 at 05:05
  • @user554481 wow, didn't know that. This is the first time I saw dataset and thought that this is just a helper. Thank you – Salvador Dali Jul 16 '17 at 05:34
  • interesting.. it is usually suggested that one should only use O(100MB) for his tfrecord files... – ArtificiallyIntelligence Dec 15 '21 at 21:35