Create a single-file dataset out of _many_ b/n GIFs

Question

I have many 100x100px black/white GIF images. I want to use them in Numpy to train a machine learning algorithm, but I would like to save them in a single file that is easily readable in Python/Numpy. By saying many I mean several hundred thousands, so I would like to take advantage of the images carrying only 1 bit per pixel.

Any idea on how I can do this?

EDIT:

I used a BitArray object, from the bitstring module. Then I saved it using numpy.savez. The problem is that it takes ages to save. I never managed to see the end of such process on the entire dataset. I tried to save a small subset and it took 10 minutes and about 20 times the size of the small subset itself.

I will try with the BoolArray, thanks for the reference.

EDIT (solved):

I solved the problem by using a different approach from those that I found in the questions you linked. I found the numpy.packbits function here: numpy boolean array with 1 bit entries

I'm reporting my code here so it can be useful to others:

accepted_shape = (100, 100)
images = []

for file_path in gifs:

    img_data = imread(file_path)

    if img_data.shape != accepted_shape:
        continue

    max_value = img_data.max()
    min_value = img_data.min()
    middle_value = (max_value - min_value) // 2
    image = np.packbits((img_data.ravel() > middle_value).astype(int))

    images.append(image)

np.vstack(images)
np.savez_compressed('dataset.npz', shape=accepted_shape, images=images)

This just requires some attention when uncompressing because if the number of bits is not a multiple of 8, some zeros are added as padding. This is how I uncompress the files:

data = np.load('dataset.npz')
shape = data['shape']
images = data['images']

nf = np.prod(shape)
ne = images.size / nf
images = np.unpackbits(images, axis=1)
images = images[:,:nf]

Here's a very similar question: http://stackoverflow.com/questions/6694835/efficient-serialization-of-numpy-boolean-arrays/6695272#6695272 — Wolph, Dec 23 '14 at 14:16

score 0 · Answer 1 · answered Dec 23 '14 at 14:28

PyTables seems like a good option here. Something like this might work:

import numpy as np
import tables as tb
nfiles = 100000 #or however many files you have
h5file = tb.openFile('data.h5', mode='w', title="Test Array")
root = h5file.root
x = h5file.createCArray(root,'x',tb.Float64Atom(),shape=(100,100,nfiles))
x[:100,:100, 0] = np.random.random(size=(100,100)) # Now put in some data
h5file.close()

Create a single-file dataset out of _many_ b/n GIFs

1 Answers1