Memory Error for large write operation with Dask in rioxarray

Question

First Question on Stack!

I am working on a problem that has very large datasets. I am trying to use rioxarray (chunked, with Dask) to pull together a large amount of geotiffs (not COGs to my knowledge), concatenate them, and save them as a netCDF. I'm running into problems when I try to write to file. I am doing this analysis on a local system, so I don't have lots of cores to use, but I was under the impression that there were settings that would make it so that, as long as the data was large enough to fit on disk, that Dask could queue up jobs so that they fit in memory. Each raster is 2,060 KB and are stored on disk for a daily timestep for 28 years, so that's: (2,060 * 365) * 28 KB!

When I reduce the number of files I'm working with the following code is successful. I got this general pattern from the post:

convert-raster-time-series-of-multiple-geotiff-images-to-netcdf

def read_rasters(files, test=False):
    """"""
    # sort the paths and create dates from them.
    if test:
        # only do 1000
        paths = sorted(files)[-1000:]
    else:
        paths = sorted(files)
    dates = [get_date_from_file(p) for p in paths]
    # print('the dates', dates)
    time = xr.Variable('time', pd.DatetimeIndex(dates))
    print('determining chunks')
    chunks = get_chunksize(path=paths[0], cog=False)
    print('opening')
    datasets = [rioxa.open_rasterio(f, chunks=chunks) for f in paths]
    return xr.concat(datasets, dim=time)

path_to_gridmet = r'path\to\Gridmet\Daily\ETo'
gridmet_glob = r'path\to\Gridmet\Daily\ETo\**\eto*.tif'
gridmet_files = glob(gridmet_glob, recursive=True)
gridmet_dset = read_rasters(gridmet_files, test=True)
print('done reading in ETo')
print(gridmet_dset)
print('writing to file')
delayed_obj = gridmet_dset.to_netcdf(path=r'Z:\Data\ReferenceET\USA\Gridmet\Daily\gm_eto.nc', compute=False)
with ProgressBar():
    results = delayed_obj.compute()

Am I missing some setting that will keep the process from breaking? I'm guessing I have to be more specific for how I direct Dask to approach the problem...

'Done reading in ETo' always prints, so reading in the array is no problem, but, when I try to write to netCDF, I get problems:

determining chunks
path 
C:\Users\gparrish\Documents\Drought\gridmet\eto1998321.tif
xras dimensions ('band', 'y', 'x')
4120
1629
opening
done reading in ETo
<xarray.DataArray (time: 6998, band: 1, y: 1629, x: 4120)>
dask.array<concatenate, shape=(6998, 1, 1629, 4120), dtype=float32, chunksize=(1, 1, 1629, 4120), chunktype=numpy.ndarray>
Coordinates:
  * band         (band) int32 1
  * x            (x) float64 -103.0 -103.0 -103.0 ... -94.44 -94.43 -94.43
  * y            (y) float64 37.0 37.0 37.0 37.0 ... 33.62 33.62 33.62 33.62
    spatial_ref  int32 0
  * time         (time) datetime64[ns] 1998-11-17 1998-11-18 ... 2018-01-01
Attributes:
    scale_factor:  1.0
    add_offset:    0.0
writing to file


Traceback (most recent call last):
  File "C:\Users\gparrish\PycharmProjects\xarray_sandbox\preprocessing\xarr_make_NCDF_ds.py", line 136, in <module>
    delayed_obj = gridmet_dset.to_netcdf(path=r'C:\Users\gparrish\Documents\Drought\gm_eto.nc', compute=False)
  File "C:\ProgramData\Anaconda3\envs\xarr_env\lib\site-packages\xarray\core\dataarray.py", line 2778, in to_netcdf
    return dataset.to_netcdf(*args, **kwargs)
  File "C:\ProgramData\Anaconda3\envs\xarr_env\lib\site-packages\xarray\core\dataset.py", line 1799, in to_netcdf
    return to_netcdf(
  File "C:\ProgramData\Anaconda3\envs\xarr_env\lib\site-packages\xarray\backends\api.py", line 1076, in to_netcdf
    dump_to_store(
  File "C:\ProgramData\Anaconda3\envs\xarr_env\lib\site-packages\xarray\backends\api.py", line 1123, in dump_to_store
    store.store(variables, attrs, check_encoding, writer, unlimited_dims=unlimited_dims)
  File "C:\ProgramData\Anaconda3\envs\xarr_env\lib\site-packages\xarray\backends\common.py", line 266, in store
    self.set_variables(
  File "C:\ProgramData\Anaconda3\envs\xarr_env\lib\site-packages\xarray\backends\common.py", line 304, in set_variables
    target, source = self.prepare_variable(
  File "C:\ProgramData\Anaconda3\envs\xarr_env\lib\site-packages\xarray\backends\scipy_.py", line 223, in prepare_variable
    self.ds.createVariable(name, data.dtype, variable.dims)
  File "C:\ProgramData\Anaconda3\envs\xarr_env\lib\site-packages\scipy\io\netcdf.py", line 388, in createVariable
    data = empty(shape_, dtype=type.newbyteorder("B"))  # convert to big endian always for NetCDF 3
numpy.core._exceptions.MemoryError: Unable to allocate 175. GiB for an array with shape (6998, 1, 1629, 4120) and data type >f4

Is my chunksize too large? Am I supposed to instantiate the dask.distributed.Client()?

Would this help? https://stackoverflow.com/a/67818144/10693596 — SultanOrazbayev, Jul 02 '21 at 07:28
Hi @SultanOrazbayev, Thanks for the input. I didn't use open_mfdataset because my dataset is in geotiffs. Now I've added the output, you can see that the opening of the datasets does work, when it comes to writing I am having issues. — Gabriel Parrish, Jul 06 '21 at 19:07

Memory Error for large write operation with Dask in rioxarray

0 Answers0