This example shows how you can use an HDF5 file for the process you describe.
First, create a HDF5 file with a dataset of shape(2_000_000, 2000)
and dtype=float64
values. I used variables for the dimensions so you can tinker with it.
import numpy as np
import h5py
import random
h5_a0, h5_a1 = 2_000_000, 2_000
with h5py.File('SO_68206763.h5','w') as h5f:
dset = h5f.create_dataset('test',shape=(h5_a0, h5_a1))
incr = 1_000
a0 = h5_a0//incr
for i in range(incr):
arr = np.random.random(a0*h5_a1).reshape(a0,h5_a1)
dset[i*a0:i*a0+a0, :] = arr
print(dset[-1,0:10]) # quick dataset check of values in last row
Next, open the file in read mode, read 10_000 random array slices of shape (16,2_000)
and append to the list L
. At the end, convert the list to the array WINS
. Note, by default the array will have 2 axes -- you need to use .reshape()
if you want 3 axes per your comment (reshape also shown).
with h5py.File('SO_68206763.h5','r') as h5f:
dset = h5f['test']
L = []
ds0, ds1 = dset.shape[0], dset.shape[1]
for i in range(10_000):
ir = random.randint(0, ds0 - 16)
window = dset[ir:ir+16, :] # window from dset of shape (16, 2000) starting at a random index i
L.append(window)
WINS = np.concatenate(L) # shape (160_000, 2_000) of float64,
print(WINS.shape, WINS.dtype)
WINS = np.concatenate(L).reshape(10_0000,16,ds1) # reshaped to (10_000, 16, 2_000) of float64
print(WINS.shape, WINS.dtype)
The procedure above is not memory efficient. You wind up with 2 copies of the randomly sliced data: in both list L and array WINS. If memory is limited, this could be a problem. To avoid the intermediate copy, read the random slide of data directly to an array. Doing this simplifies the code, and reduces the memory footprint. That method is shown below (WINS2 is a 2 axis array, and WINS3 is a 3 axis array).
with h5py.File('SO_68206763.h5','r') as h5f:
dset = h5f['test']
ds0, ds1 = dset.shape[0], dset.shape[1]
WINS2 = np.empty((10_000*16,ds1))
WINS3 = np.empty((10_000,16,ds1))
for i in range(10_000):
ir = random.randint(0, ds0 - 16)
WINS2[i*16:(i+1)*16,:] = dset[ir:ir+16, :]
WINS3[i,:,:] = dset[ir:ir+16, :]