Intro.
I am processing multiple images in parallel by mapping a set of functions using dask
client.map
to the images and return a pandas dataframe. I cannot predict the size of the output before the computation.
In order to make the code more readable and not have to drag around in the computation images metadata i have been looking in xarray
.
I create a dataset by loading .zarr
files using open_mfdataset
and parallel=True
.
Here is a snippet to create a mock dataset similar to the one i am working with
import xarray as xr
from dask import array as da
import numpy as np
# Create data array
def create_mock_dataset():
data = da.random.random([2,3,2,10,10])
mock_array = xr.DataArray(
data=data,
coords={
"fov":np.arange(2),
"round_num": np.arange(3),
'z':np.arange(2),
'r':np.arange(10),
'c':np.arange(10),
},
dims=["fov", "round_num","z","r","c"])
ds = xr.Dataset({"mock": mock_array})
chunks_dict = {'fov':1,'round_num':1,'z':2,'r':10,'c':10}
ds = ds.chunk(chunks_dict)
return ds
test_dataset = create_mock_dataset()
def chunk_processing_func(xarray_chunk):
# processing of the chunk
# reduced to different shape or
# data structure ex. pandas dataframe
mock_output = np.arange(200) # can a dataframe or another data structure
return mock_output
I would like to process process each chunk (that correspond to an image) in parallel and make use of the coords
data in the xarray.
I have been testing apply_ufunc
(following the instruction from this really clear answer ) or map_blocks
but if I understood correctly the size of the output must be known.
So what will be the best approach to process in parallel xarray datasets with functions that use the coords info and the data but return a different type of output that doesn't need to be aligned with the dataset?
Thanks!