Dealing with large grib files using xarray and dask

Question

I'm reading some (apparently) large grib files using xarray. I say 'apparently' because they're ~100MB each, which doesn't seem too big to me. However, running

import xarray as xr
ds = xr.open_dataset("gribfile.grib", engine="cfgrib")

takes a good 5-10 minutes. Worse, reading one of these takes up almost 4GB RAM - something that surprises me given the lazy-loading that xarray is supposed to do. Not least that this is 40-odd times the size of the original file!

This reading time and RAM usage seems excessive and isn't scalable to the 24 files I have to read.

I've tried using dask and xr.open_mfdataset, but this doesn't seem to help when the individual files are so large. Any suggestions?

Addendum: dataset looks like this once opened:

<xarray.Dataset>
Dimensions:     (latitude: 10, longitude: 10, number: 50, step: 53, time: 45)
Coordinates:
  * number      (number) int64 1 2 3 4 5 6 7 8 9 ... 42 43 44 45 46 47 48 49 50
  * time        (time) datetime64[ns] 2011-01-02 2011-01-04 ... 2011-03-31
  * step        (step) timedelta64[ns] 0 days 00:00:00 ... 7 days 00:00:00
    surface     int64 0
  * latitude    (latitude) float64 56.0 55.0 54.0 53.0 ... 50.0 49.0 48.0 47.0
  * longitude   (longitude) float64 6.0 7.0 8.0 9.0 10.0 ... 12.0 13.0 14.0 15.0
    valid_time  (time, step) datetime64[ns] 2011-01-02 ... 2011-04-07
Data variables:
    u100        (number, time, step, latitude, longitude) float32 6.389208 ... 1.9880934
    v100        (number, time, step, latitude, longitude) float32 -13.548858 ... -3.5112982
Attributes:
    GRIB_edition:            1
    GRIB_centre:             ecmf
    GRIB_centreDescription:  European Centre for Medium-Range Weather Forecasts
    GRIB_subCentre:          0
    history:                 GRIB to CDM+CF via cfgrib-0.9.4.2/ecCodes-2.9.2 ...

I've temporarily got around the issue by reading in the grib files, one-by-one, and writing them to disk as netcdf. xarray then handles the netcdf files as expected. Obviously it would be nice to not have to do this because it takes ages - I've only done this for 4 so far.

Can you share what you dataset (`ds`) looks like once you've opened it with xarray? — jhamman, Dec 08 '18 at 22:58
have you tried converting them to netcdf with `grib_to_netcdf`? — Matteo De Felice, Dec 09 '18 at 19:25
Just tried this. Not much faster, but memory usage is much lower which is handy. Thanks! — jezza, Dec 10 '18 at 21:08
I'm a bit surprised as I regularly test with 1Gb GRIB files with no problem. Are the files downloaded from MARS, can you share the request you are using? — alexamici, Dec 19 '18 at 11:34
I didn't download these files myself so I'm afraid I'm not overly familiar with the data retrieval side of things. I ended up using the ecCodes grib_to_netcdf command line tool and a bash script and left it running overnight. The problem with using python for the conversion was that I couldn't get python to reallocate/garbage collect the memory each time I was done with a file so I would quickly run out of RAM. I was hoping someone would say they have no problem with large files though. It would seem a massive drawback to the grib format if it was normally as slow as I was experiencing. — jezza, Mar 26 '19 at 00:19
I'm having same problem reading files from ERA5 reanalysis. My files have about 270 Mb each, downloaded directly from ECMWF's CDS. It's taking about 16min to read one single file with similar RAM usage. — Mateus da Silva Teixeira, Oct 09 '20 at 18:25
Another good choice could be to transform it into Zarr format, and deal with this. Zarr is more RAM and disk space friendly. — dl.meteo, Apr 25 '22 at 07:34

Dealing with large grib files using xarray and dask

0 Answers0