0

I am working on a data analysis task using climate variables provided in netCDF files and want to use xarray to load and process the data. I have monthly files of hourly gridded data for each variable with a naming convention somewhat like:

{prefix}_{variablename}_YYYY_MM.nc

Each file is ~1.5 GB indexed by three coordinates: time (720 values for a 30-day month), latitude (721), longitude (1440). I extract the variable values at a particular grid cell (specified by lon/lat of a point of interest), calculate the average over the month, and finally want to calculate a long-term average (30 years for example) of monthly averages.

I tried a few things. With xarray.open_mfdataset() the grouped averaging just sat there for an hour. So I decided to extract the average of the variable from each file, stuff them in a list, and convert the list to a dataframe to group and average again later.

This works better, but I have some trouble with the performance of the data extraction. The main loop, with the code to time each loop looks like this, with the variable being snow depth sd:

output = []
for yr in range(startyr, endyr+1):
    for mth in range(1, 13):
        start = time.time()
        fp = datadirpath / f"{prefix}_sd_{yr}_{str(mth).zfill(2)}.nc"
        with xr.open_dataset(fp) as src:
            one_month_snow = src['sd'].sel(
                latitude=lat, longitude=lon, 
                method='nearest')
            output.append([yr, mth, one_month_snow.mean().values.item()])
        stop = time.time()
        print(stop - start)

Each iteration takes about 22 seconds. That seems long. But very weirdly, if I re-run the exact same code, each iteration only takes <10 ms.

What makes no difference:

  • Whether I run this in a Jupyter Notebook or from a stand-alone script
  • Whether I restart the Python 3 kernel
  • Whether I use .sel(..., method='nearest') or provide the exact lon/lat coordinates

I'm a bit stumped here. Something is being cached, on the OS level? How can I optimize this code, since I'm gonna do a lot of this exact thing? Should I use chunking?

chryss
  • 7,459
  • 37
  • 46
  • 1
    where is your data and how is your dask cluster set up? if you're doing this over network-attached storage it's possible there's some caching happening, even across python sessions. diagnosing networked read performance issues without getting your hands dirty with the actual data/connections is super tough and setup-specific, so we might not be able to help here :/ – Michael Delgado Oct 31 '21 at 18:34
  • 1
    I'd definitely read the data in chunks and use open_mfdataset, rather than building a list, though. one thing I'd try is to use the `parallel=True` option of [`xr.open_mfdataset`](http://xarray.pydata.org/en/stable/generated/xarray.open_mfdataset.html) which will spread the metadata burden to the dask workers. – Michael Delgado Oct 31 '21 at 18:34
  • 1
    The underlying netcdf library does implement a cache, but I am a bit surprised it would be this effective across 30 years of data. If you are doing this a lot, yes you might be better off applying chunking across the time dimension. The difference in performance can be vast. See documentation for `nccopy`. Also this article: https://www.unidata.ucar.edu/blogs/developer/en/entry/chunking_data_why_it_matters – Robert Davy Oct 31 '21 at 22:55
  • 1
    Thank you. I have not looked into any particular dask setup. The data is on networked storage, but the script is stored on the same file system. I will try what you say. What is a reasonable chunk size across the time dimension? – chryss Nov 01 '21 at 04:15
  • Hard to really see if this can have an effect or not but I had similar issues and managed to load my dataset much faster using open_mfopendataset() with : `concat_dim="time", combine="nested", data_vars='minimal', coords='minimal', compat='override’`, you can see more advice to load multiple files [here](https://xarray.pydata.org/en/stable/user-guide/io.html#reading-multi-file-datasets). – Philippe Miron Nov 02 '21 at 18:44
  • Just in case, can you add information to the question about the system it is running on? Single computer or compute cluster. Hardware, incl. relevant specifications. Operating system, incl. version / edition. Size of physical RAM. Process size / memory consumption while it is running. Whatever else may be deemed relevant. – Peter Mortensen Sep 07 '22 at 16:54
  • *output.append* could exhibit [Shlemiel the painter’s algorithm](https://www.joelonsoftware.com/2001/12/11/back-to-basics/) behaviour. That is, building up a long list by adding small bits at end, with a full copy of the list for every step will increase the runtime in a quadratic manner. I don't know about Python, but I have done this *inadvertently* in several languages... – Peter Mortensen Sep 07 '22 at 16:56
  • OK, it is [probably not Shlemiel's fault](https://stackoverflow.com/questions/33044883/why-is-the-time-complexity-of-pythons-list-append-method-o1). [*extend()* could be about 4 times faster than append()](https://stackoverflow.com/questions/252703/what-is-the-difference-between-pythons-list-methods-append-and-extend/28119966#28119966). – Peter Mortensen Sep 07 '22 at 17:21

0 Answers0