0

I have a data set from electrophysiological recordings in a hdf5 file in the form of what is really close to numpy arrays from my understanding and what I am trying to do is access it in the most efficient and fast way.

Let me explain: The dataset is a list of arrays (2D-array?); each array contains x number of channels (recording sites), usually around 32-64.

The problem is the following: There are millions of arrays and it's taking forever to loop through every individual array. Moreover, I have to loop through each channel in each array in order retrieve the values.

Here is my code:

import h5py

f_kwd = h5py.File("experiment1_100.raw.kwd", "r") # reads hdf5 file
dset_data = f_kwd['recordings/0/data']
print (len(dset_data)) # prints 31646700
print (dset_data[0]) # prints the following

[    94   1377    208    202    246    387   1532   1003    460    665
810    638    223    363    990     78   -139    191     63    630
763     60    682   1025    472   1113   -137    360   1216    297
-71    -35   -477   -498   -541   -557  27776   2281 -11370  32767
-28849 -30243]

list_value = []
for t_stamp in (dset_data):
    for value in t_stamp:
        if value > 400:
            list_value.append(value)

Is there a way to make this a lot more efficient and quick? Do I have to use numpy and if so, how can I make this happen? I feel like I am doing something wrong here.

EDIT : Here are some additional info about the first array in dataset for the following attributes:

.shape -> (42,)
.itemsize -> 2
.dtype -> int16
.size -> 42
.ndim -> 1

EDIT2 : ..and the dataset itself:

.shape -> (31646700, 42)
.dtype -> int16
.size -> 1329161400

ukey
  • 13
  • 2
  • We need to know more about the dataset. In `h5py` a set may be 2d with 1 variable dimension, i.e a ragged 2d array. But `numpy` 2d arrays have to be rectangular. A ragged set is loaded as 1d numpy with object dtype. Access to such an array is slower. In MATLAB is that array being loaded as a `cell`? – hpaulj Apr 05 '17 at 21:22
  • http://stackoverflow.com/questions/42658438/storing-multidimensional-variable-length-array-with-h5py - A SO question about variable length array. – hpaulj Apr 05 '17 at 21:26
  • How about information for the dataset as a whole, not just one 'row'? – hpaulj Apr 05 '17 at 21:41
  • Do you have a chunked Dataset? If so, what is the chunk size of your Dataset? Do you really have a list of arrays of variable length? It looks like you are having one array with shape (31646700, 42)... – max9111 Apr 06 '17 at 14:49

2 Answers2

1

If my guess that t_stamp is a 1d array of varying length, you could collect all elements >400 with:

list_value = []
for t_stamp in (dset_data):
    list_value.append(t_stamp[t_stamp>400])
    # list_value.extend()

Use append if you want to collect the values in sublists. Use extend if you want one flat list.

It still iterates on the 'rows' of dset_data, but selection from each row will be much faster.

If all rows are 42 long, then dset_data.value will be a 2d numpy array:

dset_data[dset_data>400]

will be a flat array of the selected values

hpaulj
  • 221,503
  • 14
  • 230
  • 353
0

this may help. first change nd arry to 1d arry; second sorted it; third find index of number. In this case, you do need iterate all the items.

import numpy as np

newData = dset_data.ravel()
newData.sort()
index = np.searchsorted(data,400)
res = newData[:index ]
galaxyan
  • 5,944
  • 2
  • 19
  • 43