Selecting square submatrix from large h5 file

Question

I have a large h5 file (50gb). I need to extract a square submatrix from the file. So far my code is:

import h5py
import random 

file = h5py.File('numDistances.h5', 'r')
data = file['DS1'] # 120,000 x 120,000 matrix

randomRows = random.sample(range(110000), 40000)
randomRows.sort()

# Get the rows first and then the corresponding columns:
rows = data[randomRows, :]
output = rows[:,randomRows]

Unfortunately pulling the data out like this is very slow. Do you know any slicing techniques/additional libraries that could help me make this much faster, thanks.

I assume you've read the docs about this kind of indexing? With `randomRows` scattered through out the large file, it will have to do a lot of file `seek`. http://docs.h5py.org/en/latest/high/dataset.html#fancy-indexing. Using smaller selections that lineup with chunks might help, http://docs.h5py.org/en/latest/high/dataset.html#chunked-storage — hpaulj, Dec 30 '17 at 18:21
Thanks.I have read the docs but have found even trying with a small number of rows (100) to be quite slow. — kPow989, Dec 30 '17 at 18:27
Another recent question asked about row shuffling. I suggested reading slices, and doing the fancy indexing on the array in memory. — hpaulj, Dec 30 '17 at 18:40
https://stackoverflow.com/questions/47888392/is-it-possible-to-use-np-arrays-as-indices-in-h5py-datasets — hpaulj, Dec 30 '17 at 18:51
Do you have a chunked or compressed dataset? In such cases you have to set a proper chunk-cache-size. Some examples https://stackoverflow.com/a/44961222/4045774 https://stackoverflow.com/a/43580434/4045774 — max9111, Jan 02 '18 at 13:53

Selecting square submatrix from large h5 file

0 Answers0