Most efficient way to use a large data set for PyTorch?

Question

Perhaps this question has been asked before, but I'm having trouble finding relevant info for my situation.

I'm using PyTorch to create a CNN for regression with image data. I don't have a formal, academic programming background, so many of my approaches are ad-hoc and just terribly inefficient. May times I can go back through my code and clean things up later because the inefficiency is not so drastic that performance is significantly affected. However, in this case, my method for using the image data takes a long time, uses a lot of memory, and it is done every time I want to test a change in the model.

What I've done is essentially loaded the image data into numpy arrays, saved those arrays in an .npy file, and then when I want to use said data for the model I import all of the data in that file. I don't think the data set is really THAT large, as it is comprised of 5000, 3 color channel images of size 64x64. Yet my memory usage shoots up to 70%-80% (out of 16gb) when it is being loaded, and it takes 20-30 seconds to load in every time.

My guess is that I'm being dumb about the way I'm loading it in, but frankly I'm not sure what the standard is. Should I, in some way, put the image data somewhere before I need it, or should the data be loaded directly from the image files? And in either case, what is the best, most efficient way to do that, independent of file structure?

I would really appreciate any help on this.

You can save the image data as a hdf5 file. Then load the hdf5 file using h5py once before training. In the training loop you can use this loaded hdf5 file as an iterator to get mini-batch size images each time. It'll be blazing fast. — kmario23, Dec 02 '18 at 02:44
@kmario23 Dang, alright that sounds good, when I have a minute to get back to this project this will be the first thing I do, thank you. — Doug MacArthur, Dec 03 '18 at 02:01

OddNorg · Answer 1 · 2021-02-20T01:48:34.453

13

For speed I would advise to used HDF5 or LMDB:

Reasons to use LMDB:

LMDB uses memory-mapped files, giving much better I/O performance. Works well with really large datasets. The HDF5 files are always read entirely into memory, so you can’t have any HDF5 file exceed your memory capacity. You can easily split your data into several HDF5 files though (just put several paths to h5 files in your text file). Then again, compared to LMDB’s page caching the I/O performance won’t be nearly as good. [http://deepdish.io/2015/04/28/creating-lmdb-in-python/]

If you decide to used LMDB:

ml-pyxis is a tool for creating and reading deep learning datasets using LMDBs.*(I am co author of this tool)

It allows to create binary blobs (LMDB) and they can be read quite fast. The link above comes with some simple examples on how to create and read the data. Including python generators/ iteratos .

This notebook has an example on how to create a dataset and read it paralley while using pytorch.

If you decide to use HDF5:

PyTables is a package for managing hierarchical datasets and designed to efficiently and easily cope with extremely large amounts of data.

https://www.pytables.org/

edited Feb 20 '21 at 01:48

answered Jul 18 '19 at 17:11

OddNorg

868
1
6
18

1

I can confirm, its very fast reads. I have a question if you happen to see this, the storage footprint is much larger than what I expected for a serialized data storage. 10 MB for something in CSV is 2 MB and HDF5 is .6 MB. Do you know why that is? – LeanMan Oct 31 '20 at 07:00
@LeanMan Maybe maybe because how data is cast ? is it using floating point 32 bits ? – OddNorg Nov 04 '20 at 15:55
Amazing! May I ask does it support PyTorch as well? – Tengerye Sep 28 '21 at 02:23
Yes @Tengerye , there is a branch with a multithread torch loader. – OddNorg Sep 29 '21 at 15:25
@LeanMan btw you meant hdf5 or lmdb, when you said **I can confirm, its very fast reads**? – avocado May 01 '22 at 09:28
@avocado judging from my notes, lmdb. Although I found csv to be better than the overhead the two brought – LeanMan May 01 '22 at 18:30

kmario23 · Accepted Answer · 2020-09-01T20:44:25.500

Here is a concrete example to demonstrate what I meant. This assumes that you've already dumped the images into an hdf5 file (train_images.hdf5) using h5py.

import h5py
hf = h5py.File('train_images.hdf5', 'r')

group_key = list(hf.keys())[0]
ds = hf[group_key]

# load only one example
x = ds[0]

# load a subset, slice (n examples) 
arr = ds[:n]

# should load the whole dataset into memory.
# this should be avoided
arr = ds[:]

In simple terms, ds can now be used as an iterator which gives images on the fly (i.e. it doesn't load anything in memory). This should make the whole run time blazing fast.

for idx, img in enumerate(ds):
   # do something with `img`

score 2 · Answer 3 · answered Nov 21 '20 at 18:32

In addition to the above answers, the following may be useful due to some recent advances (2020) in the Pytorch world.

Your question: Should I, in some way, put the image data somewhere before I need it, or should the data be loaded directly from the image files? And in either case, what is the best, most efficient way to do that, independent of file structure?

You can leave the image files in their original format (.jpg, .png, etc.) on your local disk or on the cloud storage, but with one added step - compress the directory as a tar file. Please read this for more details:

Pytorch Blog (Aug 2020): Efficient PyTorch I/O library for Large Datasets, Many Files, Many GPUs (https://pytorch.org/blog/efficient-pytorch-io-library-for-large-datasets-many-files-many-gpus/)

This package is designed for situations where the data files are too large to fit in memory for training. Therefore, you give the URL of the dataset location (local, cloud, ..) and it will bring in the data in batches and in parallel.

The only (current) requirement is that the dataset must be in a tar file format.

The tar file can be on the local disk or on the cloud. With this, you don't have to load the entire dataset into the memory every time. You can use the torch.utils.data.DataLoader to load in batches for stochastic gradient descent.

score 0 · Answer 4 · answered Jun 08 '21 at 14:23

No need saving image into npy and loading all into memory. Just load a batch of image path and transform then into tensor.

The following code define the MassiveDataset, and pass it into DataLoader, everything goes well.

from torch.utils.data.dataset import Dataset
from typing import Optional, Callable
import os
import multiprocessing

def apply_transform(transform: Callable, data):
    try:
        if isinstance(data, (list, tuple)):
            return [transform(item) for item in data]

        return transform(data)
    except Exception as e:
        raise RuntimeError(f'applying transform {transform}: {e}')


class MassiveDataset(Dataset):
    def __init__(self, filename, transform: Optional[Callable] = None):
        self.offset = []
        self.n_data = 0

        if not os.path.exists(filename):
            raise ValueError(f'filename does not exist: {filename}')

        with open(filename, 'rb') as fp:
            self.offset = [0]
            while fp.readline():
                self.offset.append(fp.tell())
            self.offset = self.offset[:-1]

        self.n_data = len(self.offset)

        self.filename = filename
        self.fd = open(filename, 'rb', buffering=0)
        self.lock = multiprocessing.Lock()

        self.transform = transform

    def __len__(self):
        return self.n_data

    def __getitem__(self, index: int):
        if index < 0:
            index = self.n_data + index
        
        with self.lock:
            self.fd.seek(self.offset[index])
            line = self.fd.readline()

        data = line.decode('utf-8').strip('\n')

        return apply_transform(self.transform, data) if self.transform is not None else data

NB: open file with buffering=0 and multiprocessing.Lock() are used to avoid loading bad data (usually a bit from one part of the file and a bit from the another part of the file).

additionally, if using multiprocessing in DataLoader, one could get such exception TypeError: cannot serialize '_io.BufferedReader' object. This is caused by pickle module used in multiprocessing, it cannot serialize _io.BufferedReader, but dill can. Replacing multiprocessing with multiprocess, things goes okay (major changes compare with multiprocessing, enhanced serialization is done with dill)

same thing was discussed in this issue

if your memory is large, mmap could be used to replace `fd` without `multiprocessing.Lock()` — Eric, Sep 20 '21 at 12:53

Most efficient way to use a large data set for PyTorch?

4 Answers4

Linked