I am working on a data analysis task using climate variables provided in netCDF files and want to use xarray
to load and process the data. I have monthly files of hourly gridded data for each variable with a naming convention somewhat like:
{prefix}_{variablename}_YYYY_MM.nc
Each file is ~1.5 GB indexed by three coordinates: time (720 values for a 30-day month), latitude (721), longitude (1440). I extract the variable values at a particular grid cell (specified by lon/lat of a point of interest), calculate the average over the month, and finally want to calculate a long-term average (30 years for example) of monthly averages.
I tried a few things. With xarray.open_mfdataset()
the grouped averaging just sat there for an hour. So I decided to extract the average of the variable from each file, stuff them in a list, and convert the list to a dataframe to group and average again later.
This works better, but I have some trouble with the performance of the data extraction. The main loop, with the code to time each loop looks like this, with the variable being snow depth sd
:
output = []
for yr in range(startyr, endyr+1):
for mth in range(1, 13):
start = time.time()
fp = datadirpath / f"{prefix}_sd_{yr}_{str(mth).zfill(2)}.nc"
with xr.open_dataset(fp) as src:
one_month_snow = src['sd'].sel(
latitude=lat, longitude=lon,
method='nearest')
output.append([yr, mth, one_month_snow.mean().values.item()])
stop = time.time()
print(stop - start)
Each iteration takes about 22 seconds. That seems long. But very weirdly, if I re-run the exact same code, each iteration only takes <10 ms.
What makes no difference:
- Whether I run this in a Jupyter Notebook or from a stand-alone script
- Whether I restart the Python 3 kernel
- Whether I use
.sel(..., method='nearest')
or provide the exact lon/lat coordinates
I'm a bit stumped here. Something is being cached, on the OS level? How can I optimize this code, since I'm gonna do a lot of this exact thing? Should I use chunking?