I have a stack of 4 dimensional numpy arrays saved as .npy
files. Each one is about 1.5 GB and I have 240 files, so about 360 GB total and much larger than memory. I want to combine them into a single Zarr array in a Google Cloud Storage bucket.
My first attempt was to initialize a zarr array that is empty in the first dimension, as follows
z = zarr.open(
gcsfs.GCSFileSystem(project=<project-name>).get_mapper(<bucket-name>),
mode='w',
shape=(0,256,1440,3),
dtype=np.float32
)
then read each 1.5 GB file and append
it to the array
for fn in filenames:
z.append(np.load(fn))
and this seems to work, but it is extremely slow... It looked like it might take multiple days...
I am doing this from a virtual machine on the google cloud platform, so my personal network speed shouldn't be an issue.
Is there an efficient and workable way to accomplish this task? Maybe using dask
with an intermediate step? Any suggestions appreciated.