1

I currently have a data-set with a million rows and each around 10000 columns (variable length).

Now I want to write this data to a HDF5 file so I can use it later on. I got this to work, but it's incredibly slow. Even a 1000 values take up to a few minutes just to get stored in the HDF5 file.

I've been looking everywhere, including SO and the H5Py docs, but I really can't find anything that describes my use-case, yet I know it can be done.

Below I have made a demo-source code describing what I'm doing right now:

import h5py
import numpy as np

# I am using just random values here
# I know I can use h5py broadcasts and I have seen it being used before.
# But the issue I have is that I need to save around a million rows with each 10000 values
# so I can't keep the entire array in memory.
random_ints = np.random.random(size = (5000,10000))

# See http://stackoverflow.com/a/36902906/3991199 for "libver='latest'"
with h5py.File('my.data.hdf5', "w", libver='latest') as f:
    X = f.create_dataset("X", (5000,10000))
    for i1 in range(0, 5000):
        for i2 in range(0, 10000):
            X[i1,i2] = random_ints[i1,i2]

        if i1 != 0 and i1 % 1000 == 0:
            print "Done %d values..." % i1

This data comes from a database, it's not a pre-generated np array, as being seen in the source code.

If you run this code you can see it takes a long time before it prints out "Done 1000 values".

I'm on a laptop with 8GB ram, Ubuntu 16.04 LTS, and Intel Core M (which performs similar to Core i5) and SSD, that must be enough to perform a bit faster than this.

I've read about broadcasting here: http://docs.h5py.org/en/latest/high/dataset.html

When I use it like this:

for i1 in range(0, 5000):
        X[i1,:] = random_ints[i1]

It already goes a magnitude faster (done is a few secs). But I don't know how to get that to work with a variable-length dataset (the columns are variable-length). It would be nice to get a bit of insights in how this should be done, as I think I'm not having a good idea of the concept of HDF5 right now :) Thanks a lot!

marc_s
  • 732,580
  • 175
  • 1,330
  • 1,459
Peter Willemsen
  • 339
  • 3
  • 13
  • Yes, iterating and writing individual numbers to the file (or even to a in memory numpy array) is slow. For speed you want to work with larger chunks, thousands of numbers. – hpaulj Oct 16 '16 at 16:08
  • @hpaulj Thanks for the heads-up. Could you elaborate on that? How can I deal with the variable length? My instinct tells me to just pad the columns to their largest counterparts, and then use the second code block in my question to insert the numbers. Is that a good way to tackle this issue? – Peter Willemsen Oct 16 '16 at 17:41
  • I don't see anything in your demo code that uses variable-lengths. All you are doing is writing an array to the file either by element or by row. – hpaulj Oct 16 '16 at 18:02
  • @hpaulj that is just a demo to describe the problem, the real source has variable length. I'm in the process of trying it out with padded columns, to see if that goes any faster. I think it will! – Peter Willemsen Oct 16 '16 at 18:17
  • I've answered a few questions about `vlen`, http://stackoverflow.com/questions/30543791/inexplicable-behavior-when-using-vlen-with-h5py/30549199#30549199, but I don't think there's much expertise on the subject here. – hpaulj Oct 16 '16 at 18:23

1 Answers1

1

Following http://docs.h5py.org/en/latest/special.html

and using an open h5 file f, I tried:

dt = h5py.special_dtype(vlen=np.dtype('int32'))
vset=f.create_dataset('vset', (100,), dtype=dt)

Setting the elements one by one:

vset[0]=np.random.randint(0,100,1000)    # set just one element
for i in range(100):    # set all arrays of varying length
    vset[i]=np.random.randint(0,100,i)
vset[:]      # view the dataset

Or making an object array:

D=np.empty((100,),dtype=object)
for i in range(100):   # setting that in same way
    D[i]=np.random.randint(0,100,i)

vset[:]=D    # write it to the file

vset[:]=D[::-1]   # or write it in reverse order

A portion of the last write:

In [587]: vset[-10:]
Out[587]: 
array([array([52, 52, 46, 80,  5, 89,  6, 63, 21]),
       array([38, 95, 51, 35, 66, 44, 29, 26]),
       array([51, 96,  3, 64, 55, 31, 18]),
       array([85, 96, 30, 82, 33, 45]), array([28, 37, 61, 57, 88]),
       array([76, 65,  5, 29]), array([78, 29, 72]), array([77, 32]),
       array([5]), array([], dtype=int32)], dtype=object)

I can view portions of an element with:

In [593]: vset[3][:10]
Out[593]: array([86, 26,  2, 79, 90, 67, 66,  5, 63, 68])

but I can't treat it as a 2d array: vset[3,:10]. It's an array of arrays.

hpaulj
  • 221,503
  • 14
  • 230
  • 353