Performance of loading and storing sparse data with numpy

Question

I thought about a custom sparse data format. It's meant for space efficient storage and loading, not to do computations on it. The essence is to store the indices and values of nonzero entries. I was wondering if there are some tweaks that could improve the performance.

The need emerged from handling data like this: N "images" (32x32) with four channels each. The images contain on average ~5% nonzero values. As N grows very large, storing all the images in RAM is inefficient. So only the number of nonzero entries, their indices and values as well as the original shape is stored.

Here's a example how this can be implemented:

import numpy as np


def disassemble_data(data):
    # get some dense data and make it sparse
    lengths = np.count_nonzero(data, axis=(1, 2, 3))
    idxs = np.flatnonzero(data)
    vals = data.ravel()[idxs]

    return lengths, idxs, vals, data.shape


def assemble_data(lengths, idxs, vals, shape):
    # get some sparse data and make it dense
    data = np.zeros(shape)

    lower_idx = 0
    for length in lengths:
        upper_idx = lower_idx + length
        data.ravel()[idxs[lower_idx:upper_idx]] = vals[lower_idx:upper_idx]
        lower_idx = upper_idx
    return data


# Create some dummy data 
my_data = np.random.uniform(0, 1, (10, 4, 32, 32))
my_data[my_data > 0.05] = 0

# make data sparse and then dense again
my_new_data = assemble_data(*disassemble_data(my_data))

# assert that this actually works
assert np.allclose(my_data, my_new_data)

Now, we can directly see the advantage: The data is densified image by image. This allows us to load the whole dataset into RAM and generate dense images on demand via a generator:

def image_generator(lengths, idxs, vals, shape):          
        idxs %= np.prod(shape[1:])

        lower_idx = 0
        for length in lengths:
            upper_idx = lower_idx + length
            data = np.zeros(shape[1:])
            data.ravel()[idxs[lower_idx:upper_idx]] = vals[lower_idx:upper_idx]
            lower_idx = upper_idx
            yield data

Further it is also possible to generate batches of images:

def image_batch_generator(lengths, idxs, vals, shape, batch_size):
    idxs %= np.prod((batch_size, *shape[1:]))
    lengths = np.sum(lengths.reshape(-1, batch_size), axis=1)

    lower_idx = 0
    for length in lengths:
        upper_idx = lower_idx + length
        data = np.zeros((batch_size, *shape[1:]))
        data.ravel()[idxs[lower_idx:upper_idx]] = vals[lower_idx:upper_idx]
        lower_idx = upper_idx
        yield data

This is a rather convenient approach for my needs. Yet I was wondering if it is possible to speed this up.

E.g. I saw that numpys itemset is faster than direct assignment (according to the docs). But it only works for a single item, not for index arrays.

Are there any other approaches? I'm not at all familiar with cython and such, so I'd be happy to get some hints!

did you just reinvent [scipy.sparse](https://docs.scipy.org/doc/scipy/reference/sparse.html)? The only difference is it actually supports computations, converted to dense as needed. — Marat, Apr 09 '18 at 02:11
It depends a bit on your usecase. If you only wan't to use the sparse matrices as a "compression algorithm", but work only on dense matrices using a HDF5-Dataset with compression may also be a good and more performant way to do it. (This should get you above 600MB/s throughput) https://stackoverflow.com/a/48997927/4045774 — max9111, Apr 09 '18 at 08:03
I know that there's scipy.sparse, but the formats presented there don't match the one I use. scipy.sparse is meant for matrices, while this method generalizes to tensors of any shape. Also a large difference with this method is the possibility to yield slices of the sparse tensor, so there's no need to inflate the whole tensor first. Is there a better way to achieve this with scipy.sparse? I used HDF5 with compression (lz4). While using only half the space, the decompression empirically takes longer, and occupies a full cpu-core. Even parallelizing over 8 cores was slower. — Obay, Apr 09 '18 at 14:43

score 1 · Accepted Answer · answered Apr 09 '18 at 15:29

1

I tested a bit how this can be done more efficiently and came to the point that your approach isn't bad at all for highly uncorrelated data produced by np.random.uniform. On real data this can be a bit different.

I improved the speed of your functions a bit giving about 1.4GB/s compressing and 1.2GB/s decompressing, which isn't bad at all. With h5py (blosclz) I could achieve only about 450MB/s, but writing the data also to disk.

Improved sparse algorithm

import numpy as np
import numba as nb

#We can use uint16 on (4,32,32), since max. idx<2**16
@nb.jit()
def to_sparse_data_uint16(data):
  data_flat=data.reshape(-1)

  idx=np.empty(data.size,dtype=np.uint16)
  data_out=np.empty(data.size,dtype=data.dtype)

  ii=0
  for i in range(data_flat.shape[0]):
    if (data_flat[i]!=0):
      idx[ii]=i
      data_out[ii]=data_flat[i]
      ii+=1

  return idx[0:ii], data_out[0:ii], data.shape


def to_dense_data(idx,data,shape):
  length=np.prod(shape)
  data_out=np.zeros(length,dtype=data.dtype)

  data_out[idx]=data

  return data_out.reshape(shape)


########################
#do you really need float64 here?
images = np.array(np.random.uniform(0, 1, (100000, 4, 32, 32)),dtype=np.float32)
images[images > 0.05] = 0.

res=[]
t1=time.time()
for i in range(100000):
  res.append(to_sparse_data_uint16(images[i,:,:,:]))

print(time.time()-t1)

t1=time.time()
for i in range(100000):
  data=to_dense_data(res[i][0],res[i][1],res[i][2])

print(time.time()-t1)

HDF5 Example

import numpy as np
import tables #register blosc
import h5py as h5
import h5py_cache as h5c
import time

# I assume here that you don't need float64 for images..
# 1650MB Testdata

images = np.array(np.random.uniform(0, 1, (100000, 4, 32, 32)),dtype=np.float32)
images[images > 0.05] = 0.

#Write data (32,7 GB uncompressed)
hdf5_path='Test.h5'
f = h5c.File(hdf5_path, 'w',chunk_cache_mem_size=1024**2*100) #200 MB cache size
dset_images = f.create_dataset("images", shape=(20*100000, 4, 32, 32),dtype=np.float32,chunks=(1000, 4, 32, 32),compression=32001,compression_opts=(0, 0, 0, 0, 9, 1, 1), shuffle=False)

t1=time.time()
#Don't call h5py to often, this will lead to bad performance
for i in range(20):
    dset_images[i*100000:(i+1)*100000,:,:,:]=images

f.close()
print(time.time()-t1)

print("MB/s: " + str(32700/(time.time()-t1)))

answered Apr 09 '18 at 15:29

max9111

6,272
1
16
33

Interesting to see numba. Can this also be applied to decompression? The assumption of indices being small doesn't hold since 2**16=65536 is sadly way to small for a several million image dataset. Good catch on the float32 though! – Obay Apr 09 '18 at 19:19
In the decompression example, I access the array only once (memory bound). Please notice also that the function call overhead and the list appending ruins the performance quite a bit. The compression takes only (0.65s, the list appending another 0.4s). Ad uint16: The number of images isn't a problem here since the idx starts by zero on every image. As far as the images doesn't get bigger this should be sufficient. – max9111 Apr 09 '18 at 21:30
I seem to miss something, doesn't the loop increment i up to data.size? So i=data.size-1 in the last iteration. Then i is assigned to idx[ii]. So this will result in an overflow (or truncation?) as far as I can tell! – Obay Apr 09 '18 at 21:52
Ad image datatyp: Usual datatypes for images are uint8, and for some images uint16 where the sensor couldn't get actually beyond 10-14 bit per chanel. 16 bits are usual because this is easier to handle, but this can also be optimized... Check what is really sufficient here. Please make also clear the access pattern and the possibly not so random nonzero places and values in between (this may also get common lossless compression algorithms in front). Your images may also (or not) be somewhat correlated to another. There is no fastest\best compressing algorithm without knowing the exact data. – max9111 Apr 09 '18 at 21:52
The images are taken by multiple photomultiplier arrays and need to be calibrated, so fractional pixel intensities (i.e. photoelectron counts) with high dynamic range remain. You are completely right about the correlation. However I already tried using HDF5 compressed chunk storage with LZ4 (from the 'hdf5plugin' package), it was ~20x slower than the sparse format. I will give blosc and the chunk size setting a try tho. – Obay Apr 09 '18 at 22:04
In your example each image is (4x32x32) So idx can't go beyond 4*32*32-1=4095 (and ii not beyond 4096). As said uint16 is only suitable if np.prod(dims_of_single_image)<2**16. Please apologize, you obviously don't have 3 chanels... Using python-blosc to compress chunks of images directly is also a possibilty and can get above 2-3GB/s on an quadcore processor and far above that on a powerful workstation. I can add an example tomorrow. https://github.com/Blosc/python-blosc – max9111 Apr 09 '18 at 22:07
Aaah I see. I completeley misread your code, sorry! Your code is meant to compress each image individually and store them in a list. My code takes all the images at once and stores them all in one big array. I then simply store lengths, idxs, vals, data.shape with numpy.savez in one single file and have hilariously high read performance. I will add a benchmark later. – Obay Apr 09 '18 at 22:14

Performance of loading and storing sparse data with numpy

1 Answers1