0

I have a large h5 file (50gb). I need to extract a square submatrix from the file. So far my code is:

import h5py
import random 

file = h5py.File('numDistances.h5', 'r')
data = file['DS1'] # 120,000 x 120,000 matrix

randomRows = random.sample(range(110000), 40000)
randomRows.sort()

# Get the rows first and then the corresponding columns:
rows = data[randomRows, :]
output = rows[:,randomRows]

Unfortunately pulling the data out like this is very slow. Do you know any slicing techniques/additional libraries that could help me make this much faster, thanks.

kPow989
  • 426
  • 5
  • 22
  • I assume you've read the docs about this kind of indexing? With `randomRows` scattered through out the large file, it will have to do a lot of file `seek`. http://docs.h5py.org/en/latest/high/dataset.html#fancy-indexing. Using smaller selections that lineup with chunks might help, http://docs.h5py.org/en/latest/high/dataset.html#chunked-storage – hpaulj Dec 30 '17 at 18:21
  • Thanks.I have read the docs but have found even trying with a small number of rows (100) to be quite slow. – kPow989 Dec 30 '17 at 18:27
  • Another recent question asked about row shuffling. I suggested reading slices, and doing the fancy indexing on the array in memory. – hpaulj Dec 30 '17 at 18:40
  • https://stackoverflow.com/questions/47888392/is-it-possible-to-use-np-arrays-as-indices-in-h5py-datasets – hpaulj Dec 30 '17 at 18:51
  • Do you have a chunked or compressed dataset? In such cases you have to set a proper chunk-cache-size. Some examples https://stackoverflow.com/a/44961222/4045774 https://stackoverflow.com/a/43580434/4045774 – max9111 Jan 02 '18 at 13:53

0 Answers0