I opened a large dask xarray with the dimensions (time: 20, y: 50000, x: 100000). The variable 'var' I want to use contains uint8 values. For each timestep I want to convert the uint8 values to a letter via the dictionary d. For each location (y,x) I want to concatenate the 20 letters to a string of the format 'S20'.
I am using the xr.apply_ufunc() to run the code in parallel. Although I define the output_dtype to be 'S20' in the xr.apply_ufunc(), the output has a dtype of 'S1'. I have tested it with a smaller fraction of the xarray (size: 500, 1000) and noticed that if I load the variable before and don't specify the dtype in the xr.apply_ufunc() the output has the required dtype of 'S20'. My entire dataset is too large to load it to memory before.
My question is: How do I specify the output_dtype correctly without loading the xarray beforehand?
This is my code:
import xarray as xr
from dask.diagnostics import ProgressBar
from dask.distributed import Client
client = Client()
fp = 'myfile.nc'
ds = xr.open_dataset(fp, chunks={"y": 500, "x": 1000})
ds.close()
var = ds['var']
d = dict({
0: 'A',
1: 'B',
2: 'C'
})
def ttrans(tarray):
for t in range(0,20):
vt = d[tarray[t]]
if t == 0:
temp = vt
else:
temp = temp + vt
return temp
def pwrap(ds, dim=['time'], dask='parallelized'):
with ProgressBar():
res = xr.apply_ufunc(ttrans,
ds,
input_core_dims=[dim],
vectorize=True,
dataset_fill_value='N',
dask=dask,
output_dtypes=['S20']
).compute()
return res
result = pwrap(ds = var, dim = ['time'], dask='parallelized')