I have a data set from electrophysiological recordings in a hdf5 file in the form of what is really close to numpy arrays from my understanding and what I am trying to do is access it in the most efficient and fast way.
Let me explain: The dataset is a list of arrays (2D-array?); each array contains x number of channels (recording sites), usually around 32-64.
The problem is the following: There are millions of arrays and it's taking forever to loop through every individual array. Moreover, I have to loop through each channel in each array in order retrieve the values.
Here is my code:
import h5py
f_kwd = h5py.File("experiment1_100.raw.kwd", "r") # reads hdf5 file
dset_data = f_kwd['recordings/0/data']
print (len(dset_data)) # prints 31646700
print (dset_data[0]) # prints the following
[ 94 1377 208 202 246 387 1532 1003 460 665
810 638 223 363 990 78 -139 191 63 630
763 60 682 1025 472 1113 -137 360 1216 297
-71 -35 -477 -498 -541 -557 27776 2281 -11370 32767
-28849 -30243]
list_value = []
for t_stamp in (dset_data):
for value in t_stamp:
if value > 400:
list_value.append(value)
Is there a way to make this a lot more efficient and quick? Do I have to use numpy and if so, how can I make this happen? I feel like I am doing something wrong here.
EDIT : Here are some additional info about the first array in dataset for the following attributes:
.shape -> (42,)
.itemsize -> 2
.dtype -> int16
.size -> 42
.ndim -> 1
EDIT2 : ..and the dataset itself:
.shape -> (31646700, 42)
.dtype -> int16
.size -> 1329161400