2

I have one big numpy array A of shape (2_000_000, 2000) of dtype float64, which takes 32 GB.

(or alternatively the same data split into 10 arrays of shape (200_000, 2000), it may be easier for serialization?).

How can we serialize it to disk such that we can have fast random read access to any part of the data?

More precisely I need to be able to read ten thousands of windows of shape (16, 2 000) from A at random starting indexes i:

L = []
for i in range(10_000):
    i = random.randint(0, 2_000_000 - 16):
    window = A[i:i+16, :]         # window of A of shape (16, 2000) starting at a random index i
    L.append(window)
WINS = np.concatenate(L)   # shape (10_000, 16, 2000) of float64, ie: ~ 2.4 GB

Let's say I only have 8 GB of RAM available for this task; it's totally impossible to load the whole 32 GB of A in RAM.

How can we read such windows in a serialized-on-disk numpy array? (.h5 format or any other)

Note: The fact the reading is done at randomized starting indexes is important.

kcw78
  • 7,131
  • 3
  • 12
  • 44
Basj
  • 41,386
  • 99
  • 383
  • 673
  • 1
    Does this answer your question? https://stackoverflow.com/questions/29209293/how-to-store-an-array-in-hdf5-file-which-is-too-big-to-load-in-memory – orlp Jul 01 '21 at 08:43
  • Thank you @orlp, it is related, but it does not answer it since here the specific part here is that I need to be able to take windows / slices of the original big array, at random starting points. These details make it more specific. – Basj Jul 01 '21 at 09:01
  • You could try NumPy memory maps: `numpy.memmap(filename, dtype=, mode='r+', offset=0, shape=None, order='C')` – Ryan Pepper Jul 01 '21 at 10:39
  • @RyanPepper I tried, but here is the problem I have: https://stackoverflow.com/questions/68209831/resize-in-place-a-numpy-memmap – Basj Jul 01 '21 at 12:05
  • @orlp's suggestion to write your data to a HDF5 file is good solution. Once the array is in the file, it's easy to read slices from the dataset. The h5py package supports NumPy indexing (including some fancy indexing techniques). – kcw78 Jul 01 '21 at 19:15

2 Answers2

1

This example shows how you can use an HDF5 file for the process you describe.

First, create a HDF5 file with a dataset of shape(2_000_000, 2000) and dtype=float64 values. I used variables for the dimensions so you can tinker with it.

import numpy as np
import h5py
import random

h5_a0, h5_a1 = 2_000_000, 2_000

with h5py.File('SO_68206763.h5','w') as h5f:
    dset = h5f.create_dataset('test',shape=(h5_a0, h5_a1))
    
    incr = 1_000
    a0 = h5_a0//incr
    for i in range(incr):
        arr = np.random.random(a0*h5_a1).reshape(a0,h5_a1)
        dset[i*a0:i*a0+a0, :] = arr       
    print(dset[-1,0:10])  # quick dataset check of values in last row

Next, open the file in read mode, read 10_000 random array slices of shape (16,2_000) and append to the list L. At the end, convert the list to the array WINS. Note, by default the array will have 2 axes -- you need to use .reshape() if you want 3 axes per your comment (reshape also shown).

with h5py.File('SO_68206763.h5','r') as h5f:
    dset = h5f['test']
    L = []
    ds0, ds1 = dset.shape[0], dset.shape[1]
    for i in range(10_000):
        ir = random.randint(0, ds0 - 16)
        window = dset[ir:ir+16, :]  # window from dset of shape (16, 2000) starting at a random index i
        L.append(window)
    WINS = np.concatenate(L)   # shape (160_000, 2_000) of float64,
    print(WINS.shape, WINS.dtype)
    WINS = np.concatenate(L).reshape(10_0000,16,ds1)   # reshaped to (10_000, 16, 2_000) of float64
    print(WINS.shape, WINS.dtype)

The procedure above is not memory efficient. You wind up with 2 copies of the randomly sliced data: in both list L and array WINS. If memory is limited, this could be a problem. To avoid the intermediate copy, read the random slide of data directly to an array. Doing this simplifies the code, and reduces the memory footprint. That method is shown below (WINS2 is a 2 axis array, and WINS3 is a 3 axis array).

with h5py.File('SO_68206763.h5','r') as h5f:
    dset = h5f['test']
    ds0, ds1 = dset.shape[0], dset.shape[1]
    WINS2 = np.empty((10_000*16,ds1))
    WINS3 = np.empty((10_000,16,ds1))
    for i in range(10_000):
        ir = random.randint(0, ds0 - 16)
        WINS2[i*16:(i+1)*16,:] = dset[ir:ir+16, :]
        WINS3[i,:,:] = dset[ir:ir+16, :]
kcw78
  • 7,131
  • 3
  • 12
  • 44
1

An alternative soluton to h5py datasets that I tried and that works is using memmap, as suggested in @RyanPepper's comment.

Write the data as binary

import numpy as np
with open('a.bin', 'wb') as A:
    for f in range(1000):
        x =  np.random.randn(10*2000).astype('float32').reshape(10, 2000)
        A.write(x.tobytes())
        A.flush()

Open later as memmap

A = np.memmap('a.bin', dtype='float32', mode='r').reshape((-1, 2000))
print(A.shape)  # (10000, 2000)
print(A[1234:1234+16, :])  # window
Basj
  • 41,386
  • 99
  • 383
  • 673