Tensorflow Dataset using many compressed numpy files

Question

I have a large dataset that I would like to use for training in Tensorflow.

The data is stored in compressed numpy format (using numpy.savez_compressed). There are variable numbers of images per file due to the way they are produced.

Currently I use a Keras Sequence based generator object to train, but I'd like to move entirely to Tensorflow without Keras.

I'm looking at the Dataset API on the TF website, but it is not obvious how I might use this to read numpy data.

My first idea was this

import glob
import tensorflow as tf
import numpy as np

def get_data_from_filename(filename):
   npdata = np.load(open(filename))
   return npdata['features'],npdata['labels']

# get files
filelist = glob.glob('*.npz')

# create dataset of filenames
ds = tf.data.Dataset.from_tensor_slices(filelist)
ds.flat_map(get_data_from_filename)

However, this passes a TF Tensor placeholder to a real numpy function and numpy is expecting a standard string. This results in the error:

File "test.py", line 6, in get_data_from_filename
   npdata = np.load(open(filename))
TypeError: coercing to Unicode: need string or buffer, Tensor found

The other option I'm considering (but seems messy) is to create a Dataset object built on TF placeholders which I then fill during my epoch-batch loop from my numpy files.

Any suggestions?

your filename would be a tensor, which you are trying to open using numpy, which is why the error is thrown. You might need to use the [py_func](https://www.tensorflow.org/api_docs/python/tf/py_func) method to read your data this way. — kvish, Nov 29 '18 at 18:11
How might that work? I'm playing around with `py_func` but I can't get the inputs/outputs correct. My function takes a filename string as input, and outputs two numpy arrays. If I use `ds.flat_map(get_data_from_filename,[tf.string],[tf.float32,tf.float32])` I get the error `Tensors in list passed to 'input' of 'PyFunc' Op have types [] that are invalid.` and I'm not entirely sure how to correctly use this function in this context. — Taylor Childers, Nov 29 '18 at 18:56
I have added an answer as a reference for code! Let me know if that helps :) — kvish, Nov 29 '18 at 19:10

score 2 · Accepted Answer · edited Mar 20 '19 at 11:46

2

You can define a wrapper and use pyfunc like this:

def get_data_from_filename(filename):
   npdata = np.load(filename)
   return npdata['features'], npdata['labels']

def get_data_wrapper(filename):
   # Assuming here that both your data and label is float type.
   features, labels = tf.py_func(
       get_data_from_filename, [filename], (tf.float32, tf.float32)) 
   return tf.data.Dataset.from_tensor_slices((features, labels))

# Create dataset of filenames.
ds = tf.data.Dataset.from_tensor_slices(filelist)
ds.flat_map(get_data_wrapper)

If your dataset is very large and you have memory issues, you can consider using a combination of interleave or parallel_interleave and from_generator methods instead. The from_generator method uses py_func internally so you can directly read your np file and then define your generator in python.

edited Mar 20 '19 at 11:46

Augustin

2,444
23
24

answered Nov 29 '18 at 19:09

kvish

992
1
8
12

I got an error from `flat_map` that is `TypeError: must return a 'Dataset' object.` so I added in `get_data_wrapper` that it returns `tf.data.Dataset.from_tensor_slices(features,labels)` and that does not throw an error. – Taylor Childers Nov 29 '18 at 19:27
@Taylor yes sorry I missed that! I will update the code to reflect it should be a dataset – kvish Nov 29 '18 at 19:31
Just to verify that if evaluate this using an iterator in a session, I get the proper output. You mention using `from_generator` in case I have a large dataset. It was my understanding that this method would load files on the fly as needed, but your comment implies it will try opening many at once. Is that so? – Taylor Childers Nov 29 '18 at 19:31
@Taylor the method would be using interleave with from_generator. The from_generator would be for opening a single npz file, and yielding data one by one or in batches, however it suits you. The interleave method would open files concurrently, and interleave the results in an order. If you check the documentation that I had shared in the answer, they give you an idea of how to read 4 files concurrently for example. You would map the x to a generator in this case instead of a TextLineDataset as shown in that example – kvish Nov 29 '18 at 19:36
Thanks, it appears `parallel_interleave` is deprecated in favor of simply `interleave`. – Taylor Childers Nov 29 '18 at 19:42
Funny enough, the behavior of interleave with from_generator is quite frustrating. Even with parallel calls, it always opens/decompresses the files in all parallel threads at the same time each means its not really more efficient than serial calls. – Taylor Childers Nov 29 '18 at 20:44
@Taylor yeah I think it was just designed to be able to provide an avenue to use generators and assemble data for very large data points per file, when you might want to not load all the data and shuffle it for instance. You could have it randomly split across files and then load it in this order akin to a shuffling. – kvish Nov 29 '18 at 20:57

Tensorflow Dataset using many compressed numpy files

1 Answers1