2

Intro.
I am processing multiple images in parallel by mapping a set of functions using dask client.map to the images and return a pandas dataframe. I cannot predict the size of the output before the computation.

In order to make the code more readable and not have to drag around in the computation images metadata i have been looking in xarray.

I create a dataset by loading .zarr files using open_mfdataset and parallel=True. Here is a snippet to create a mock dataset similar to the one i am working with

import xarray as xr
from dask import array as da
import numpy as np 
# Create data array
def create_mock_dataset():
    data = da.random.random([2,3,2,10,10])
    mock_array = xr.DataArray(
        data=data,
    coords={
            "fov":np.arange(2),
            "round_num": np.arange(3),
            'z':np.arange(2), 
            'r':np.arange(10), 
            'c':np.arange(10),
        },
        dims=["fov", "round_num","z","r","c"])
    ds = xr.Dataset({"mock": mock_array})
    chunks_dict = {'fov':1,'round_num':1,'z':2,'r':10,'c':10}
    ds = ds.chunk(chunks_dict)
    return ds


test_dataset = create_mock_dataset()


def chunk_processing_func(xarray_chunk):
    # processing of the chunk
    # reduced to different shape or
    # data structure ex. pandas dataframe
    mock_output = np.arange(200) # can a dataframe or another data structure
    return mock_output


I would like to process process each chunk (that correspond to an image) in parallel and make use of the coords data in the xarray. I have been testing apply_ufunc (following the instruction from this really clear answer ) or map_blocks but if I understood correctly the size of the output must be known.

So what will be the best approach to process in parallel xarray datasets with functions that use the coords info and the data but return a different type of output that doesn't need to be aligned with the dataset?

Thanks!

s1mc0d3
  • 523
  • 2
  • 15

0 Answers0