Copying to h5py very slow

Asked Mar 09 '18 at 09:52

Active Mar 09 '18 at 09:52

Viewed 217 times

I am currently adding to a h5py array, having defined an array as:

f = h5py.File(batch_path,'w')
data = f.create_dataset('data_set',(525600,1300),dtype=np.float32)

and adding arrays to it as:

for index,file in enumerate(files):
    df = pd.read_csv(file)
    result = np.array(list(map(lambda x: float(x.split(';')[1]),df.as_matrix()[:,0])))
    data[:,index] = result[:]

However, the last step (data[:,index] = result[:]) takes an incredible amount of time. What is wrong here?

asked Mar 09 '18 at 09:52

Erik

1

I have seen inappropriate chunking massively affect read & write performance. If you know what kind of out-of-memory queries you will be performing on your HDF5 dataset, you can select an appropriate chunk size. See [h5py chunking docs](http://docs.h5py.org/en/latest/high/dataset.html#chunked-storage) for details. – jpp Mar 09 '18 at 10:25
Any difference if you save to a (1300,525600) set, iterating on 1st dimension. – hpaulj Mar 09 '18 at 15:45
1) Use chunking. For example https://stackoverflow.com/a/48405220/4045774 2) Take also a look at the "The simplest form of fancy slicing". This will also have a noticeable impact on performance. – max9111 Mar 20 '18 at 12:28

Copying to h5py very slow

0 Answers0