How to store and load training data comprised 50 millions 25x25 numpy arrays while training a multi-class CNN model?

Question

I have an image processing problem where there are five classes, each class has approximately 10 millions examples as training data where an image is a z-scored 25x25 numpy array.

Obviously, I can’t load all the training data into memory, so I have to use fit_generator.

I also the one who generates and augments these training data matrices, but I can’t do it in real time within fit_generator because it will be too slow to train the model.

First, how to store 50 millions 25x25 .npy arrays on disk? What would be the best practice?

Second, Should I use a database to store these matrices and to query from it during training? I don’t think SQLite supports multi threads, and SQL datasets support is still experimental in tensorflow.

I would love to know if there is a neat way to store these 50 million matrices, so the retrieval during training will be optimal.

Third, what about using HDF5 format? Should I switch to pytorch instead?

Yes, SQLDataset is only available from TensorFlow 2.1, so it is pretty new at the time speaking; the only close solution to what you desire is only what you provided. EDIT:As per your edited question, yes, HDF5 is also suitable as your data is big and you may also need parallel I/O. As for the title of the question, you would train a multi-class model, not a binary one if you had 5 classes. — Timbus Calin, Jan 21 '20 at 07:51
@TimbusCalin if you have a suggestion how to reconstruct the training data into HD5 file/files? I am also ok to try the latest tensorflow if it’s needed, the only issue is that SQLite is single thread afaik. — 0x90, Jan 21 '20 at 07:55
Not a direct answer to your question, but 50M 25x25 images for training is a lot. Depending on the use case, you might reach satisfactory model performance with just a fraction of the training data. — sdcbr, Jan 21 '20 at 13:38
@ox90, have a look at this SO Q&A [How can I combine multiple .h5 file?](https://stackoverflow.com/q/58187004/10462884) It shows how to combine multiple CSV files into HDF5; there are 2 answers: 1 using **h5py** and 1 using **pytables**. You can use the same process to combine NPY file, just substitute the method to read the data. — kcw78, Jan 22 '20 at 06:34
@kcw78 thank you, the question if combining 10 millions files together makes sense and is preferable over sql server. — 0x90, Jan 22 '20 at 14:54

Victor Deleau · Answer 1 · 2020-01-30T14:43:06.693

How to store np.arrays() on disk ?

Storing them in a HDF5 file is a good idea. The basic HDF5 type is a Datasets, which contain multidimensional arrays of an homogeneous type. HDF5 Datasets files can be assembled together into HDF5 Groups files, which can also contain other groups, to create more complex structures. Another way is to pickle your numpy arrays or more abstract dataset objects directly from disk, but then your file would be readable by Python only. It is also discouraged for security reasons. Finally if you want to optimize your data format for TensorFlow read/write operations you can use the TFRecord file format. Saving your numpy array in a TFRecord format can be tricky but thanksfully someone created a script to do that.

Should I use a database to store these matrices and to query from them during training?

You could, but then you would reinvent the wheel. What you need is one or more separate processes in parralel of your training process, reading the next batch of training observations (prefetch), and applying some transformations to them while the training process is working on the previous batch. This way you avoid any IO and preprocessing delay, and can get some significant performance gains. AI frameworks developed their own tools for this problem. In Pytorch, there is the class torch.utils.data.DataLoader. Here is a tutorial that shows how to efficiently load HDF5 files using a Dataloader. In TensorFlow, you can create an input pipeline using the class tf.data.Dataset. A basic approach is to first open a file(s) (1), read the data from the file(s) into the memory (2), then train your model using what's in memory (3). Let's mock a TF Dataset and training loop:

import tf, time

class MyDataset(tf.data.Dataset):
    def __new__(self, filename="image_dataset.proto"):
        time.sleep(0.01) # mock step (1) delay
        return tf.data.TFRecordDataset([filename])

def train(dataset, nb_epoch=10):
    start_time = time.perf_counter()
    for epoch_num in range(nb_epoch):
        for sample in dataset: # where step (2) delay takes place
            time.sleep(0.01) # mock step (3) delay
        tf.print("Execution time:", time.perf_counter() - start_time)

You can just apply steps (1, 2, 3) sequentially:

train(MyDataset())

A better way is to read the next batch of data while the training process is still training on the previous batch, such that steps (2, 3) can happen in parralel. Apply transformations to the next batch while still training on the previous batch is also possible. To prefetch:

train(MyDataset().prefetch(tf.data.experimental.AUTOTUNE))

Additionally you can have multiple processes to read your file(s) and have a sequence of steps (1, 2) running in parralel:

train( tf.data.Dataset.range(2).interleave(\
    MyDataset().prefetch(tf.data.experimental.AUTOTUNE),\
    num_parallel_calls=tf.data.experimental.AUTOTUNE))

Learn more in the documentation.

Should I switch to Pytorch instead ?

Almost everything that Pytorch can do, TensorFlow can do too. TensorFlow has been the most production ready AI framework for a while, used by Google for their TPUs. Pytorch is catching up though. I would say that Pytorch is more research/development oriented, while TensorFlow is more production oriented. Another difference is how you design your neural networks: Pytorch works by adding layers on top of each others, while in TensorFlow you first design a computational graph that you run on some input data at some point. People often develop their models in Pytorch, and then export them in a TensorFlow format to use in production.

related: https://stackoverflow.com/questions/49579684/what-is-the-difference-between-dataset-from-tensors-and-dataset-from-tensor-slic — 0x90, Jan 24 '20 at 04:59
" Consuming Python generators Another common data source that can easily be ingested as a tf.data.Dataset is the python generator. Caution: While this is a convienient approach it has limited portability and scalibility. It must run in the same python process that created the generator, and is still subject to the Python GIL. " Is there a way to avoid that? (https://www.tensorflow.org/guide/data#consuming_python_generators) — 0x90, Jan 29 '20 at 20:20
your calculation isn't right for the number of potential different images, I am afraid: assume a 3x3 binary matrix. It has `2**9` possible matrices (assuming no translation invariance). — 0x90, Jan 29 '20 at 20:27
I think you are right, my bad ! For the Python generator, I am no expert. There are better ways to load the data though, especially tf.data.TFRecordDataset — Victor Deleau, Jan 29 '20 at 20:35
https://stackoverflow.com/questions/47568998/tensorflow-load-data-in-multiple-threads-on-cpu maybe this one can help.. — 0x90, Jan 29 '20 at 20:36

0x90 · Accepted Answer · 2020-07-16T11:53:04.697

Here is some code I found on medium (can't find the original post).

This will help to generate training data on-the-fly in a producer-consumer fashion:

import tensorflow as tf
import numpy as np

from time import sleep

class DataGen():
    counter = 0

    def __init__(self):
        self.gen_num = DataGen.counter
        DataGen.counter += 1

    def py_gen(self, gen_name):
        gen_name = gen_name.decode('utf8') + '_' + str(self.gen_num)
        for num in range(10):
            sleep(0.3)
            yield '{} yields {}'.format(gen_name, num)

Dataset = tf.data.Dataset
dummy_ds = Dataset.from_tensor_slices(['Gen1', 'Gen2', 'Gen3'])
dummy_ds = dummy_ds.interleave(lambda x: Dataset.from_generator(DataGen().py_gen, output_types=(tf.string), args=(x,)),
                               cycle_length=5,
                               block_length=2,
                               num_parallel_calls=5)
data_tf = dummy_ds.as_numpy_iterator()
for d in data_tf:
  print(d)

Output:

b'Gen1_0 yields 0'
b'Gen1_0 yields 1'
b'Gen2_0 yields 0'
b'Gen2_0 yields 1'
b'Gen3_0 yields 0'
b'Gen3_0 yields 1'
b'Gen1_0 yields 2'
b'Gen1_0 yields 3'
b'Gen2_0 yields 2'
b'Gen2_0 yields 3'
b'Gen3_0 yields 2'
b'Gen3_0 yields 3'
b'Gen1_0 yields 4'
b'Gen1_0 yields 5'
b'Gen2_0 yields 4'
b'Gen2_0 yields 5'
b'Gen3_0 yields 4'
b'Gen3_0 yields 5'
b'Gen1_0 yields 6'
b'Gen1_0 yields 7'
b'Gen2_0 yields 6'
b'Gen2_0 yields 7'
b'Gen3_0 yields 6'
b'Gen3_0 yields 7'
b'Gen1_0 yields 8'
b'Gen1_0 yields 9'
b'Gen2_0 yields 8'
b'Gen2_0 yields 9'
b'Gen3_0 yields 8'
b'Gen3_0 yields 9'

How to store and load training data comprised 50 millions 25x25 numpy arrays while training a multi-class CNN model?

2 Answers2