I thought about a custom sparse data format. It's meant for space efficient storage and loading, not to do computations on it. The essence is to store the indices and values of nonzero entries. I was wondering if there are some tweaks that could improve the performance.
The need emerged from handling data like this: N "images" (32x32) with four channels each. The images contain on average ~5% nonzero values. As N grows very large, storing all the images in RAM is inefficient. So only the number of nonzero entries, their indices and values as well as the original shape is stored.
Here's a example how this can be implemented:
import numpy as np
def disassemble_data(data):
# get some dense data and make it sparse
lengths = np.count_nonzero(data, axis=(1, 2, 3))
idxs = np.flatnonzero(data)
vals = data.ravel()[idxs]
return lengths, idxs, vals, data.shape
def assemble_data(lengths, idxs, vals, shape):
# get some sparse data and make it dense
data = np.zeros(shape)
lower_idx = 0
for length in lengths:
upper_idx = lower_idx + length
data.ravel()[idxs[lower_idx:upper_idx]] = vals[lower_idx:upper_idx]
lower_idx = upper_idx
return data
# Create some dummy data
my_data = np.random.uniform(0, 1, (10, 4, 32, 32))
my_data[my_data > 0.05] = 0
# make data sparse and then dense again
my_new_data = assemble_data(*disassemble_data(my_data))
# assert that this actually works
assert np.allclose(my_data, my_new_data)
Now, we can directly see the advantage: The data is densified image by image. This allows us to load the whole dataset into RAM and generate dense images on demand via a generator:
def image_generator(lengths, idxs, vals, shape):
idxs %= np.prod(shape[1:])
lower_idx = 0
for length in lengths:
upper_idx = lower_idx + length
data = np.zeros(shape[1:])
data.ravel()[idxs[lower_idx:upper_idx]] = vals[lower_idx:upper_idx]
lower_idx = upper_idx
yield data
Further it is also possible to generate batches of images:
def image_batch_generator(lengths, idxs, vals, shape, batch_size):
idxs %= np.prod((batch_size, *shape[1:]))
lengths = np.sum(lengths.reshape(-1, batch_size), axis=1)
lower_idx = 0
for length in lengths:
upper_idx = lower_idx + length
data = np.zeros((batch_size, *shape[1:]))
data.ravel()[idxs[lower_idx:upper_idx]] = vals[lower_idx:upper_idx]
lower_idx = upper_idx
yield data
This is a rather convenient approach for my needs. Yet I was wondering if it is possible to speed this up.
E.g. I saw that numpys itemset is faster than direct assignment (according to the docs). But it only works for a single item, not for index arrays.
Are there any other approaches? I'm not at all familiar with cython and such, so I'd be happy to get some hints!