How to specify dtype of 'S20' in parallel processing with xr.apply_ufunc() in python?

Question

I opened a large dask xarray with the dimensions (time: 20, y: 50000, x: 100000). The variable 'var' I want to use contains uint8 values. For each timestep I want to convert the uint8 values to a letter via the dictionary d. For each location (y,x) I want to concatenate the 20 letters to a string of the format 'S20'.

I am using the xr.apply_ufunc() to run the code in parallel. Although I define the output_dtype to be 'S20' in the xr.apply_ufunc(), the output has a dtype of 'S1'. I have tested it with a smaller fraction of the xarray (size: 500, 1000) and noticed that if I load the variable before and don't specify the dtype in the xr.apply_ufunc() the output has the required dtype of 'S20'. My entire dataset is too large to load it to memory before.

My question is: How do I specify the output_dtype correctly without loading the xarray beforehand?

This is my code:

import xarray as xr
from dask.diagnostics import ProgressBar
from dask.distributed import Client
client = Client()

fp = 'myfile.nc'
ds = xr.open_dataset(fp, chunks={"y": 500, "x": 1000})
ds.close()
var = ds['var']

d = dict({
    0: 'A', 
    1: 'B', 
    2: 'C'
    })

def ttrans(tarray):
    for t in range(0,20):
        vt = d[tarray[t]] 
        if t == 0:
            temp = vt
        else:
            temp = temp + vt
    return temp

def pwrap(ds, dim=['time'], dask='parallelized'):
    with ProgressBar():
        res = xr.apply_ufunc(ttrans, 
                                    ds, 
                                    input_core_dims=[dim],
                                    vectorize=True, 
                                    dataset_fill_value='N', 
                                    dask=dask,  
                                    output_dtypes=['S20']
                                      ).compute()
    return res

result = pwrap(ds = var, dim = ['time'], dask='parallelized')

How about `xr.apply_ufunc(d.get, var, vectorize=True, dask="parallelized").str.join("time")`. There may also be faster ways to translate the numbers to strings, see: https://stackoverflow.com/q/16992713/3010700 — mathause, Feb 01 '22 at 17:00
No that does not work for dask arrays. So how about `v = xr.apply_ufunc(d.get, var, vectorize=True, dask="parallelized")` and then `xr.apply_ufunc(np.vectorize("".join, signature="(m)->()"), v, dask="parallelized", input_core_dims=(["time"],),).compute()` — mathause, Feb 01 '22 at 17:41
numpy uses `np.dtype("U2").char` as dtype (https://github.com/numpy/numpy/blob/c30876f6411ef0c5365a8e4cf40cc3d4ba41196c/numpy/lib/function_base.py#L2261) so the `"U20"` becomes `"U"` — mathause, Feb 01 '22 at 17:44
@mathause thank you for your reply. I am not sure if I understand your second comment. Would the variable "v" serve as a replacement for my "ttrans" function? Or would "v" be part of the two-step calculation of my "res" variable. In the latter case, I understand that d.get would correspond to ttrans.get but the .get does not work for functions (I get the error message: 'function' object has no attribute 'get') — chamalu, Feb 05 '22 at 18:08
It would be a two step calculation and the `d.get` is your dict and your `ttrans` function would no longer be needed. — mathause, Feb 07 '22 at 10:31

How to specify dtype of 'S20' in parallel processing with xr.apply_ufunc() in python?

0 Answers0