Python HDF5 Sparse Out of Core Datasets

Question

How do you store sparse NDArrays on disk in Python?

I am answering my own question because I wasted almost a week trying to get sparse out of core matrices. Perhaps this is obvious to some, but not to me and perhaps another poor soul!

score 0 · Accepted Answer · edited May 23 '17 at 12:23

Hinted by the accepted answer here and then tested with datasets made by h5py, the following time series test worked.

>>> f = h5py.File('./test.h5')
>>> d = f.create_dataset('test', (10000, 10000), chunks=(100, 100))
>>> f.flush()
>>> d[1,1] = 1.0
>>> f.flush()
>>> d[2,1] = 1.0
>>> f.flush()
>>> d[2,100] = 1.0
>>> f.flush()
>>> d[2000,100] = 1.0
>>> f.flush()
>>> d[2000,1000] = 1.0
>>> f.flush()
>>>

Below are the file sizes reported by bash after each flush

$ ls -lth test.h5
-rw-rw-r-- 1 aidan aidan 1.4K Jul 28 18:51 test.h5
$ ls -lth test.h5
-rw-rw-r-- 1 aidan aidan 43K Jul 28 18:51 test.h5
$ ls -lth test.h5
-rw-rw-r-- 1 aidan aidan 43K Jul 28 18:52 test.h5
$ ls -lth test.h5
-rw-rw-r-- 1 aidan aidan 83K Jul 28 18:52 test.h5
$ ls -lth test.h5
-rw-rw-r-- 1 aidan aidan 122K Jul 28 18:52 test.h5
$ ls -lth test.h5
-rw-rw-r-- 1 aidan aidan 161K Jul 28 18:53 test.h5
$

It can be seen that the file is only increasing in size by increments of 40Kb (100x100 floats) and only when elements out size of existing chunks are made. We can also skip around and only chunks are made that are needed (ie not intermediate chunks)!

Magic!

Python HDF5 Sparse Out of Core Datasets

1 Answers1