I'm trying to feed 1D numpy arrays (flattend images) via a generator into a H5py data file in order to create training and validation matrices.
The following code was adapted from a solution (can't find it now) in which the data
attribute of H5py's File
objects's create_dataset
function is provided data in the form of a call to np.fromiter
which has a generator function as one of its arguments.
from scipy.misc import imread
import h5py
import numpy as np
import os
# Creating h5 data file
f = h5py.File('../data.h5', 'w')
# Source directory for image data
src = '/datasets/aic540/train/images/'
# Showing quantity and dimensionality of data
images = os.listdir(src)
ex_img = imread(src + images[0])
flat_img = ex_img.flatten()
print "# of images is {}".format(len(images))
print "image shape is {}".format(ex_img.shape)
print "flattened image shape is {}".format(flat_img.shape)
# Creating generator to feed in data to h5py's `create_dataset` function
gen = (imread(src + i).flatten().astype(np.int8) for i in os.listdir(src))
# Creating h5 dataset
f.create_dataset(name='training',
#shape=(59482, 1555200),
data=np.fromiter(gen, dtype=np.int8))
Output:
# of images is 59482
image shape is (540, 960, 3)
flattened image shape is (1555200,)
Traceback (most recent call last):
File "process_images.py", line 30, in <module>
data=np.fromiter(gen, dtype=np.int8))
ValueError: setting an array element with a sequence.
I've read when searching for this error in this context that the problem is that np.fromiter()
needs a list and not a generator function (which seems opposed to the function that the name "fromiter" implies) -- wrapping the generator in a list call list(gen)
allows the code to run but it, of course, uses up all the memory in the expansion of this list before the call to create_dataset
is made.
How do I use a generator to feed data into an H5py data file?
If my approach is entirely wrong, what is the correct way to build a very large numpy matrix that doesn't fit in memory -- using H5py or otherwise?