0

I have a stack of 4 dimensional numpy arrays saved as .npy files. Each one is about 1.5 GB and I have 240 files, so about 360 GB total and much larger than memory. I want to combine them into a single Zarr array in a Google Cloud Storage bucket.

My first attempt was to initialize a zarr array that is empty in the first dimension, as follows

z = zarr.open(
    gcsfs.GCSFileSystem(project=<project-name>).get_mapper(<bucket-name>),
    mode='w',
    shape=(0,256,1440,3),
    dtype=np.float32
)

then read each 1.5 GB file and append it to the array

for fn in filenames:
    z.append(np.load(fn))

and this seems to work, but it is extremely slow... It looked like it might take multiple days...

I am doing this from a virtual machine on the google cloud platform, so my personal network speed shouldn't be an issue.

Is there an efficient and workable way to accomplish this task? Maybe using dask with an intermediate step? Any suggestions appreciated.

qsfzy
  • 554
  • 5
  • 17
  • 1
    that's definitely what I'd recommend - dask.array has [`dask.array.from_npy_stack`](https://docs.dask.org/en/stable/generated/dask.array.from_npy_stack.html) which should come in handy. the easiest way to do this might be to create a dask.array, then wrap this in an xarray Dataset and write to zarr. but you could do this using zarr directly as well. – Michael Delgado Sep 28 '22 at 03:48
  • I was looking into the `dask.array.from_npy_stack` but it doesn't work with stacks that you generate yourself. It apparently requires an extra configuration file that dask writes when you use `dask.array.to_npy_stack` and I think it also requires specific file names. – qsfzy Sep 28 '22 at 16:08
  • 1
    bummer! in that case you could use the distributed.client, preferably in threaded mode if you're working locally, to map jobs over tiles in parallel. This question might help: https://stackoverflow.com/a/72666271/3888719 - you'd have to adapt the example to work with multiple files instead of portions of a large file, but the map_blocks workflow should do the trick I think. – Michael Delgado Sep 28 '22 at 16:46

1 Answers1

3

Assuming the npy files are all the same shape, the simplest thing that you can do is:

z = zarr.open(
    gcsfs.GCSFileSystem(project=<project-name>).get_mapper(<bucket-name>),
    mode='w',
    shape=(240,256,1440,3),
    dtype=np.float32
)

for i, fn in enumerate(filenames):
    z[i] = np.load(fn)

This does not use dask's parallelism, of course, but it does skip the checking and updating of the metadata and any last-chunk operations that append() has to do.

Note that you are using the default compression here, blosc. With this much data, it would be worth your while finding which compression/options work best, and also consider whether you would consider lossy encoding to save bandwidth and storage costs.

mdurant
  • 27,272
  • 5
  • 45
  • 74
  • This is helpful. I was hoping to avoid pre-calculating the final size along the 0th dimension, but that's not such a hassle. – qsfzy Sep 28 '22 at 16:09
  • 1
    definitely preferable to pre-define your array than to resize everything on each iteration. so if it's not a huge hassle I'd follow mdurant's advice here for sure. – Michael Delgado Sep 28 '22 at 16:49
  • You can always *over*size your array without problem. Any area without corresponding files in the store are regarded as NaN (or whatever filling value you specify). – mdurant Sep 28 '22 at 17:03
  • Yes and I learned from the question linked above (https://stackoverflow.com/a/72666271/3888719) that you can use `np.load(, mmap_mode='r').shape[0]` to quickly get a large file's dimensions and sum up the sizes on an axis, which makes it pretty easy. – qsfzy Sep 28 '22 at 17:16
  • 1
    Also: `np.lib.format.read_magic(f); np.lib.format.read_array_header_1_0(f)`, which is more explicit and would work for any file-like object, e.g., a remote file opened with fsspec. – mdurant Sep 28 '22 at 17:30
  • FWIW, properly sizing the array at the beginning instead of using `append` does seem to be helpful, noticeably faster, but I haven't bothered timing anything. – qsfzy Sep 28 '22 at 20:52