0

I have 2 large arrays stored using h5py. I want to make some basic numpy operations like addition, subtraction, etc. What is the most elegant way to do so?

f = h5py.File('x', 'w')

d1 = f.create_dataset('1', (100000, 10000), 'i')
d2 = f.create_dataset('2', (100000, 10000), 'i')

d2[:] = 1

np.add(d2, d2, out=d1)

np.add() has problem with the output argument because its not ArrayType. I'm assuming that I need to implement addition function by my self loading the whole file only block-wise and so don't "eat" all the memory, right? Something like this:

for block_index in range(d2.shape[0]):
    d1[block_index:] = d2[block_index:] + d2[block_index:]

Or is there any nicer solution?

Thanks

hpaulj
  • 221,503
  • 14
  • 230
  • 353
  • 1
    `hp5y` does not implement math itself. You have to load the datasets as numpy arrays, and do the calculations with those. `h5py is just an interface between the file system and `numpy`. http://docs.h5py.org/en/stable/high/dataset.html#reading-writing-data – hpaulj Sep 10 '19 at 18:13
  • Yes you have to do it block-wise. To do this efficiently you need to think of chunksize, chunk_cache_size (rdcc_nbytes), number_of_slots(rdcc_nslots). https://stackoverflow.com/questions/48385256/optimal-hdf5-dataset-chunk-shape-for-reading-rows/48405220#48405220 In newer h5py versions this parameters are included in the main h5py lib. http://docs.h5py.org/en/stable/high/file.html – max9111 Sep 11 '19 at 08:35

1 Answers1

0

You are correct; d1 and d2 are h5py dataset objects, not numpy ndarrays. You need to add [:] to get an ndarray from the dataset.

I modified your example above to create datasets 1 and 2 (each filled with constant values), then add them as ndarray. The result is a new ndarray (d3_arr) which is then saved to new dataset 3.

Example 1:

import h5py, numpy as np

with h5py.File('SO_57868593.h5', 'w') as h5f :

    d1 = h5f.create_dataset('1', (10, 10), 'i')
    d2 = h5f.create_dataset('2', (10, 10), 'i')

    d1[:] = 1
    d2[:] = 2
    d3_arr = np.ndarray( (10, 10), 'i' )
    np.add(d1[:], d2[:], out=d3_arr)    
    d3 = h5f.create_dataset('3', data=d3_arr)
    print ('done')

Example 2:
In this method, dataset 3 is created first (empty/no data), and d3_arr is a ndarray with a single row. The for loop iterates over the rows of d1 and d2, adding them to get d3_arr, then copying d3_arr into the matching row in dataset 3.

# replace/reorder code below as shown
#   d3_arr = np.ndarray( (10, 10), 'i' )
#   np.add(d1[:], d2[:], out=d3_arr)
#   d3 = h5f.create_dataset('3', data=d3_arr)

    d3 = h5f.create_dataset('3', (10, 10), 'i')
    d3_arr = np.ndarray( (1, 10), 'i' )
    for row in range(d1.shape[0]) :
        np.add(d1[row,:], d2[row,:], out=d3_arr)
        d3[row,:] = d3_arr
kcw78
  • 7,131
  • 3
  • 12
  • 44
  • If I understand it right, d1 and d2 therefore, needs to be loaded to RAM? at the moment when I use [:] I would probably switch to use numpy.memmap. – Miroslav Karpíšek Sep 11 '19 at 07:16
  • d1 and d2 are h5py objects (datasets). The entire dataset is not loaded into memory. However, d1[:] and d2[:] are numpy arrays (ndarrays in this example). If memory is an issue, you could perform the addition on a row-by-row basis. I think this is a simpler process than reading HDF5, creating a memmap, doing your addition, then writing back to HDF5. See second example added to my answer. – kcw78 Sep 11 '19 at 14:20