0

The code is taking about 20 minutes to load a month for each variable with 168 time steps for the cycle of 00 and 12 UTC of each day. When it comes to saving to csv, the code takes even longer, it's been running for almost a day and it still hasn't saved to any station. How can I improve the code below?

enter image description here

  • 1
    Please read the xarray guide to”More Advanced Indexing” within the indexing and selecting data section. You should never ever be looping over elements of the array and manually constructing a new array one element at a time. See this post for an example: https://stackoverflow.com/a/69337183/3888719 – Michael Delgado Feb 01 '23 at 19:10
  • 1
    In terms of getting help with this post - sorry, but no way can we wade through this much code. You need to do your own work of finding what part of your code is causing the problem and asking a narrowly-defined question with an accompanying *minimal* [mre]. See also [crafting a minimal bug report](//matthewrocklin.com/blog/work/2018/02/28/minimal-bug-reports) for perspective and tips. Good luck! – Michael Delgado Feb 01 '23 at 19:13
  • Thank you Michael Delgado! I'll take a look at the links you mentioned. The code is actually complete and works the way I put it above. Adapting the same to netcdf files the process happens quickly, but with grib2 it is taking too long and I don't understand the reason. But I'll take a look at your tips. @MichaelDelgado – William Jacondino Feb 01 '23 at 19:31
  • I saw your indexing example and it really improved the speed to open the file a lot. To save the csv do you recommend modifying something in the code as well? @MichaelDelgado – William Jacondino Feb 01 '23 at 20:21
  • The issue here isn’t that the example isn’t reproducible/complete (although it isn’t - we don’t have your files) but that it’s not anywhere close to minimal :) and really I’d recommend strongly against saving to csv at all if you can help it - csv is a text format which is extremely inefficient for both reading and writing. If you have to save it in a tabular format I’d recommend something binary like parquet. But why not netcdf or even better, zarr? Those will allow you to keep the dimensionality of the data and avoid a reshape. – Michael Delgado Feb 01 '23 at 22:18
  • Note also though that since you’re using dask, you’re probably only seeing the scheduling time until you convert the data to a dataframe. This will force the read, and it’s only at this stage that you’ll see the full read time. I’d recommend profiling your code with only one dataset at a time, to make that workflow as efficient as possible, then come back with a more narrow question if you still need help. What is true for one dataset will also be true for 12 or however many you have in this question. – Michael Delgado Feb 01 '23 at 22:23

1 Answers1

1

Reading .grib files using xr.open_mfdataset() and cfgrib:

I can speak to the slowness of reading grib files using xr.open_mfdataset(). I had a similar task where I was reading in many grib using xarray and it was taking forever. Other people have experienced similar issues with this as well (see here).

According to the issue raised here, "cfgrib is not optimized to handle files with a huge number of fields even if they are small."

One thing that worked for me was converting as many of the individual grib files as I could to one (or several) netcdf files and then read in the newly created netcdf file(s) to xarray instead. Here is a link to show you how you could do this with several different methods. I went with the grib_to_netcdf command via ecCodes tool.

In summary, I would start with converting your grib files to netcdf, as it should be able to read in the data to xarray in a more performant manner. Then you can focus on other optimizations further down in your code.

I hope this helps!

Jeff Coldplume
  • 343
  • 1
  • 13