First Question on Stack!
I am working on a problem that has very large datasets. I am trying to use rioxarray (chunked, with Dask) to pull together a large amount of geotiffs (not COGs to my knowledge), concatenate them, and save them as a netCDF. I'm running into problems when I try to write to file. I am doing this analysis on a local system, so I don't have lots of cores to use, but I was under the impression that there were settings that would make it so that, as long as the data was large enough to fit on disk, that Dask could queue up jobs so that they fit in memory. Each raster is 2,060 KB and are stored on disk for a daily timestep for 28 years, so that's: (2,060 * 365) * 28 KB!
When I reduce the number of files I'm working with the following code is successful. I got this general pattern from the post:
convert-raster-time-series-of-multiple-geotiff-images-to-netcdf
def read_rasters(files, test=False):
""""""
# sort the paths and create dates from them.
if test:
# only do 1000
paths = sorted(files)[-1000:]
else:
paths = sorted(files)
dates = [get_date_from_file(p) for p in paths]
# print('the dates', dates)
time = xr.Variable('time', pd.DatetimeIndex(dates))
print('determining chunks')
chunks = get_chunksize(path=paths[0], cog=False)
print('opening')
datasets = [rioxa.open_rasterio(f, chunks=chunks) for f in paths]
return xr.concat(datasets, dim=time)
path_to_gridmet = r'path\to\Gridmet\Daily\ETo'
gridmet_glob = r'path\to\Gridmet\Daily\ETo\**\eto*.tif'
gridmet_files = glob(gridmet_glob, recursive=True)
gridmet_dset = read_rasters(gridmet_files, test=True)
print('done reading in ETo')
print(gridmet_dset)
print('writing to file')
delayed_obj = gridmet_dset.to_netcdf(path=r'Z:\Data\ReferenceET\USA\Gridmet\Daily\gm_eto.nc', compute=False)
with ProgressBar():
results = delayed_obj.compute()
Am I missing some setting that will keep the process from breaking? I'm guessing I have to be more specific for how I direct Dask to approach the problem...
'Done reading in ETo' always prints, so reading in the array is no problem, but, when I try to write to netCDF, I get problems:
determining chunks
path
C:\Users\gparrish\Documents\Drought\gridmet\eto1998321.tif
xras dimensions ('band', 'y', 'x')
4120
1629
opening
done reading in ETo
<xarray.DataArray (time: 6998, band: 1, y: 1629, x: 4120)>
dask.array<concatenate, shape=(6998, 1, 1629, 4120), dtype=float32, chunksize=(1, 1, 1629, 4120), chunktype=numpy.ndarray>
Coordinates:
* band (band) int32 1
* x (x) float64 -103.0 -103.0 -103.0 ... -94.44 -94.43 -94.43
* y (y) float64 37.0 37.0 37.0 37.0 ... 33.62 33.62 33.62 33.62
spatial_ref int32 0
* time (time) datetime64[ns] 1998-11-17 1998-11-18 ... 2018-01-01
Attributes:
scale_factor: 1.0
add_offset: 0.0
writing to file
Traceback (most recent call last):
File "C:\Users\gparrish\PycharmProjects\xarray_sandbox\preprocessing\xarr_make_NCDF_ds.py", line 136, in <module>
delayed_obj = gridmet_dset.to_netcdf(path=r'C:\Users\gparrish\Documents\Drought\gm_eto.nc', compute=False)
File "C:\ProgramData\Anaconda3\envs\xarr_env\lib\site-packages\xarray\core\dataarray.py", line 2778, in to_netcdf
return dataset.to_netcdf(*args, **kwargs)
File "C:\ProgramData\Anaconda3\envs\xarr_env\lib\site-packages\xarray\core\dataset.py", line 1799, in to_netcdf
return to_netcdf(
File "C:\ProgramData\Anaconda3\envs\xarr_env\lib\site-packages\xarray\backends\api.py", line 1076, in to_netcdf
dump_to_store(
File "C:\ProgramData\Anaconda3\envs\xarr_env\lib\site-packages\xarray\backends\api.py", line 1123, in dump_to_store
store.store(variables, attrs, check_encoding, writer, unlimited_dims=unlimited_dims)
File "C:\ProgramData\Anaconda3\envs\xarr_env\lib\site-packages\xarray\backends\common.py", line 266, in store
self.set_variables(
File "C:\ProgramData\Anaconda3\envs\xarr_env\lib\site-packages\xarray\backends\common.py", line 304, in set_variables
target, source = self.prepare_variable(
File "C:\ProgramData\Anaconda3\envs\xarr_env\lib\site-packages\xarray\backends\scipy_.py", line 223, in prepare_variable
self.ds.createVariable(name, data.dtype, variable.dims)
File "C:\ProgramData\Anaconda3\envs\xarr_env\lib\site-packages\scipy\io\netcdf.py", line 388, in createVariable
data = empty(shape_, dtype=type.newbyteorder("B")) # convert to big endian always for NetCDF 3
numpy.core._exceptions.MemoryError: Unable to allocate 175. GiB for an array with shape (6998, 1, 1629, 4120) and data type >f4
Is my chunksize too large? Am I supposed to instantiate the dask.distributed.Client()?