Questions tagged [zarr]

Zarr is a Python package providing an implementation of compressed, chunked, N-dimensional arrays, designed for use in parallel computing.

Zarr is a Python package providing an implementation of compressed, chunked, N-dimensional arrays (like NetCDF4, HDF5), designed for use in parallel computing and on the Cloud. See http://zarr.readthedocs.io/en/stable/ for more information.

93 questions
13
votes
1 answer

How best to rechunk a NetCDF file collection to Zarr dataset

I'm trying to rechunk a NetCDF file collection and create a Zarr dataset on AWS S3. I have 168 original NetCDF4 classic files with arrays of dimension time: 1, y: 3840, x: 4608 chunked as chunks={'time':1, 'y':768, 'x':922}. I want to write this…
Rich Signell
  • 14,842
  • 4
  • 49
  • 77
11
votes
0 answers

Parallel appending to a zarr store via xarray.to_zarr and Dask

I am in a situation where I want to load objects, transform them into an xarray.Dataset and write that into a zarr store on s3. However, to make the loading of objects faster, I do it in parallel across distinct years using a Dask cluster, and thus,…
Maxime
  • 131
  • 3
7
votes
0 answers

How to cut down/delete part of a zarr array

I have a simple array (say length 1000) of objects in zarr. I want to replace it with a slimmed down version, picking only a subset of the items, as specified using a boolean array of size 1000. I want to keep everything else the same (e.g. if this…
user2667066
  • 1,867
  • 2
  • 19
  • 30
5
votes
0 answers

Storing Dask Array using Zarr Consumes Too Much Memory

I have a long list of .zarr arrays, that I would like to merge into a single array and write to disk. My code approximately looks as follows: import dask.array import zarr import os local_paths = ['parts/X_00000000.zarr', 'parts/X_00000001.zarr', …
r0f1
  • 2,717
  • 3
  • 26
  • 39
5
votes
1 answer

Creating a generator over a zarr array with start and end for pytorch dataloader

I'm working on a pytorch project where my data is saved in zarr. Random access on zarr is costly, but thanks to zarr using a blockwise cache, iteration is really quick. To harness this fact, I use an IterableDataset together with multiple…
sobek
  • 1,386
  • 10
  • 28
5
votes
2 answers

Adding new Xarray DataArray to an existing Zarr store without re-writing the whole dataset?

How do I add a new DataArray to an existing Dataset without overwriting the whole thing? The new DataArray shares some coordinates with the existing one, but also has new ones. In my current implementation, the Dataset gets completely overwritten,…
jkmacc
  • 6,125
  • 3
  • 30
  • 27
4
votes
1 answer

How to avoid reading half-written arrays spanning multiple chunks using zarr?

In a multiprocess situation, I want to avoid reading arrays from a zarr group that haven't fully finished writing by the other process yet. This functionality does not seem to come out of the box with zarr. While chunk writing is atomic in zarr,…
bluppfisk
  • 2,538
  • 3
  • 27
  • 56
4
votes
1 answer

How to efficiently convert npy to xarray / zarr

I have a 37 GB .npy file that I would like to convert to Zarr store so that I can include coordinate labels. I have code that does this in theory, but I keep running out of memory. I want to use Dask in-between to facilitate doing this in chunks,…
thomaskeefe
  • 1,900
  • 18
  • 19
4
votes
1 answer

ValueError: unrecognized engine zarr must be one of: ['scipy', 'store']

I am trying to open zarr file as, import pandas as pd import xarray as xr xf = xr.open_zarr("../../data/processed/geolink_norge_dataset/geolink_norge_well_logs.zarr") But there comes out the errors: ValueError …
K Code
  • 41
  • 1
  • 2
4
votes
0 answers

How to store a set of images and labels dynamically in zarr format?

I have read zarr documentation,but could not able do this. I want something like this tree format: ---dataset --- sample1 --frames( 10 rgb frames) --labels (1/0) --- sample2 --frames( 10 rgb frames ) …
4
votes
1 answer

I'm getting a TypeError when converting .h5 (HDF5) file into .zarr format

I'm trying to convert the .h5 file to .zarr format but I'm getting folowing error TypeError: Object of type bytes_ is not JSON serializable I'm posting my code bellow import h5py import zarr from sys import stdout source = h5py.File('file.h5',…
Pratik Gorade
  • 41
  • 1
  • 2
3
votes
0 answers

Disable xarray's automatic use of dask within a dask task

Background I'm using dask to manage tens, sometimes hundreds of thousands of jobs, each of which involves reading in zarr data, transforming the data in some way, and writing out output (one output per job). I'm using a pangeo/daskhub-style…
Michael Delgado
  • 13,789
  • 3
  • 29
  • 54
3
votes
2 answers

Getting a view of a zarr array slice

I would like to produce a zarr array pointing to part of a zarr array on disk, similar to how sliced = np_arr[5] gives me a view into np_arr, such that modifying the data in sliced modifies the data in np_arr. Example code: import matplotlib.pyplot…
Juan
  • 5,433
  • 21
  • 23
3
votes
1 answer

How to convert numpy array to a Zarr array

Suppose I have a converted a simple to column dataframe to a numpy array: gdf.head() >>> rid rast 0 1 01000001000761C3ECF420013F0761C3ECF42001BF7172... 1 2 01000001000761C3ECF420013F0761C3ECF42001BF64BF... 2 3 …
gwydion93
  • 1,681
  • 3
  • 28
  • 59
3
votes
2 answers

how to load and process zarr files using dask and xarray

I have monthly zarr files in s3 that have gridded temperature data. I would like to pull down multiple months of data for one lat/lon and create a dataframe of that time series. Some pseudo code: datasets=[] for file in files: s3 =…
David
  • 181
  • 13
1
2 3 4 5 6 7