1

I'm trying to feed 1D numpy arrays (flattend images) via a generator into a H5py data file in order to create training and validation matrices.

The following code was adapted from a solution (can't find it now) in which the data attribute of H5py's File objects's create_dataset function is provided data in the form of a call to np.fromiter which has a generator function as one of its arguments.

from scipy.misc import imread
import h5py
import numpy as np
import os

# Creating h5 data file
f = h5py.File('../data.h5', 'w')

# Source directory for image data
src = '/datasets/aic540/train/images/'

# Showing quantity and dimensionality of data
images = os.listdir(src)
ex_img = imread(src + images[0])
flat_img = ex_img.flatten()
print "# of images is {}".format(len(images))
print "image shape is {}".format(ex_img.shape)
print "flattened image shape is {}".format(flat_img.shape)

# Creating generator to feed in data to h5py's `create_dataset` function
gen = (imread(src + i).flatten().astype(np.int8) for i in os.listdir(src))

# Creating h5 dataset
f.create_dataset(name='training',
                 #shape=(59482, 1555200),
                 data=np.fromiter(gen, dtype=np.int8))

Output:

# of images is 59482
image shape is (540, 960, 3)
flattened image shape is (1555200,)
Traceback (most recent call last):
  File "process_images.py", line 30, in <module>
    data=np.fromiter(gen, dtype=np.int8))
ValueError: setting an array element with a sequence.

I've read when searching for this error in this context that the problem is that np.fromiter() needs a list and not a generator function (which seems opposed to the function that the name "fromiter" implies) -- wrapping the generator in a list call list(gen) allows the code to run but it, of course, uses up all the memory in the expansion of this list before the call to create_dataset is made.

How do I use a generator to feed data into an H5py data file?

If my approach is entirely wrong, what is the correct way to build a very large numpy matrix that doesn't fit in memory -- using H5py or otherwise?

aweeeezy
  • 806
  • 1
  • 9
  • 22
  • You have to write chunks. `np.fromiter(..., dtype=np.int8)` makes an array - 1d. So even if it could make the array from a generator, it still creates the whole thing in memory before passing it on to the file. – hpaulj Jul 15 '17 at 01:33
  • @hpaulj so similar to the way ali_m suggests in this post? https://stackoverflow.com/questions/34531479/writing-a-large-hdf5-dataset-using-h5py It seems pretty inelegant/convoluted... I tried using the much simpler looking `chunk` attribute of the `create_dataset` function but that's unfortunately not working. – aweeeezy Jul 15 '17 at 01:43

1 Answers1

1

The with a sequence error comes from what you are trying to feed fromiter, not the generator part.

In py3, range is generator like:

In [15]: np.fromiter(range(3),dtype=int)
Out[15]: array([0, 1, 2])
In [16]: np.fromiter((2*x for x in range(3)),dtype=int)
Out[16]: array([0, 2, 4])

But if I start with a 2d array (which imread produces, right?), and create a generator expression as you do:

In [17]: gen = (np.ones((2,3)).flatten().astype(np.int8) for i in range(3))
In [18]: list(gen)
Out[18]: 
[array([1, 1, 1, 1, 1, 1], dtype=int8),
 array([1, 1, 1, 1, 1, 1], dtype=int8),
 array([1, 1, 1, 1, 1, 1], dtype=int8)]

I generate a list of arrays.

In [19]: gen = (np.ones((2,3)).flatten().astype(np.int8) for i in range(3))
In [21]: np.fromiter(gen, np.int8)
...
ValueError: setting an array element with a sequence.

np.fromiter creates a 1d array from an iterator that provides 'numbers' one at a time, not something that dishes out lists or arrays.

In any case, npfromiter creates a full array; not some sort of generator. There's nothing like an array 'generator'.


Even without chunking you can write data to the file by 'row' or other slice.

In [28]: f = h5py.File('test.h5', 'w')
In [29]: data = f.create_dataset(name='test',shape=(100,10))
In [30]: for i in range(100):
    ...:     data[i,:] = np.arange(i,i+10)
    ...:     
In [31]: data
Out[31]: <HDF5 dataset "test": shape (100, 10), type "<f4">

The equivalent in your case is to load an image, reshape it, and write it immediately to the h5py dataset. No need to collect all the images in an array or list.

read 10 rows:

In [33]: data[:10,:]
Out[33]: 
array([[  0.,   1.,   2.,   3.,   4.,   5.,   6.,   7.,   8.,   9.],
       [  1.,   2.,   3.,   4.,   5.,   6.,   7.,   8.,   9.,  10.],
       [  2.,   3.,   4.,   5.,   6.,   7.,   8.,   9.,  10.,  11.],
       [  3.,   4.,   5.,   6.,   7.,   8.,   9.,  10.,  11.,  12.],
       [  4.,   5.,   6.,   7.,   8.,   9.,  10.,  11.,  12.,  13.],
       [  5.,   6.,   7.,   8.,   9.,  10.,  11.,  12.,  13.,  14.],
       [  6.,   7.,   8.,   9.,  10.,  11.,  12.,  13.,  14.,  15.],
       [  7.,   8.,   9.,  10.,  11.,  12.,  13.,  14.,  15.,  16.],
       [  8.,   9.,  10.,  11.,  12.,  13.,  14.,  15.,  16.,  17.],
       [  9.,  10.,  11.,  12.,  13.,  14.,  15.,  16.,  17.,  18.]], dtype=float32)

Enabling chunking might help with really large datasets, but I don't experience in that area.

hpaulj
  • 221,503
  • 14
  • 230
  • 353
  • Thank you for the write up. I managed to get a working solution my emulating what ali_m had suggested in the link I pasted in my comment under my original post. I made a generator function that yields ndarrays of flattened images of shape (chunk_size, len(img_array)) and then iteratively resized the h5 dataset and inserted an ndarray for each generated chunk. Your solution is much simpler and probably more appropriate as the recommended chunk size for h5 is ~1MiB which is a little less than the size of one image. Though chunking might still be useful if I down sample the images later. – aweeeezy Jul 15 '17 at 05:05