What to do when the data is too big to be stored in memory

Question

I want to train a neural network, I work with Python (3.6.9) and Tensorflow (2.4.0) and my problem is that my dataset is too big to be stored in memory.

A bit of context :

My network takes in input a small complex matrix of dimension 64 by 32.
My dataset is stored in the form of a very large ".mat" file generated by a matlab code.
In the mat file, the samples are stored in a large cell array.
I use the h5py library to open the mat file.

Example of python code to load only one sample from the file :

f = h5py.File('dataset.mat', 'r')
refs = f['data']                         # array of reference of each sample 
sample = f[refs[0]][()].view(np.complex) # load the first sample

Currently, I load only a small part of the dataset that I store in a tensorflow dataset (ds = tf.data.Dataset.from_tensor_slices(datas)).

I would like to take advantage of the possibility offered by the h5py library to be able to load each example individually to load the examples on the fly during network training.

I tried the following approach:

f = h5py.File('dataset.mat', 'r')
refs = f['data']                         # array of reference of each sample 

ds_index = tf.data.Dataset.range(len(refs))
ds = ds_index.map(lambda i : f[refs[i]][()].view(np.complex))

but, I have the following error :

NotImplementedError: in user code:

    <ipython-input-66-6cf802c8359a>:15 __call__  *
        return self._f[self._rs[i]]['channel'][()].view(np.complex).astype(np.complex64).T
    /usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/ops.py:855 __array__
        " a NumPy call, which is not supported".format(self.name))

    NotImplementedError: Cannot convert a symbolic Tensor (args_0:0) to a numpy array. This error may indicate that you're trying to pass a Tensor to a NumPy call, which is not supported

Do you know how to fix this error or can it be a better way to load examples on the fly ?

Does this answer your question? [TensorFlow - tf.data.Dataset reading large HDF5 files](https://stackoverflow.com/questions/48309631/tensorflow-tf-data-dataset-reading-large-hdf5-files) — Lescurel, Jan 19 '21 at 09:52
Thanks for the quick answer, it's an interesting approach but on a tensorflow dataset built from a generator, I can't do `len(ds)` to recover the dataset size which makes it complicated to use shuffle, take and skip to split the data into 3 subset (train, val and test). Do you know a way for a generator to know the size of its dataset? — Romain Negrel, Jan 19 '21 at 10:22

What to do when the data is too big to be stored in memory

0 Answers0