Efficient way to extract data from NETCDF files

Question

I have a number of coordinates (roughly 20000) for which I need to extract data from a number of NetCDF files each comes roughly with 30000 timesteps (future climate scenarios). Using the solution here is not efficient and the reason is the time spent at each i,j to convert "dsloc" to "dataframe" (look at the code below). ** an example NetCDF file could be download from here **

import pandas as pd
import xarray as xr
import time

#Generate some coordinates
coords_data = [{'lat': 68.04, 'lon': 15.20, 'stid':1},
    {'lat':67.96, 'lon': 14.95, 'stid': 2}]
crd= pd.DataFrame(coords_data)
lat = crd["lat"]
lon = crd["lon"]
stid=crd["stid"]

NC = xr.open_dataset(nc_file)
point_list = zip(lat,lon,stid)
start_time = time.time()
for i,j,id in point_list:
    print(i,j)
    dsloc = NC.sel(lat=i,lon=j,method='nearest')
    print("--- %s seconds ---" % (time.time() - start_time))
    DT=dsloc.to_dataframe()
    DT.insert(loc=0,column="station",value=id)
    DT.reset_index(inplace=True)
    temp=temp.append(DT,sort=True)
    print("--- %s seconds ---" % (time.time() - start_time))

which results is:

68.04 15.2
--- 0.005853414535522461 seconds ---
--- 9.02660846710205 seconds ---
67.96 14.95
--- 9.028568267822266 seconds ---
--- 16.429600715637207 seconds ---

which means each i,j takes around 9 seconds to process. Given lots of coordinates and netcdf files with large timesteps, I wonder if there a pythonic way that the code could be optimized. I could also use CDO and NCO operators but I found a similar issue using them too.

Great question, I can't work it out, but maybe you can try: https://examples.dask.org/xarray.html, I came up with this idea because I know for a dataframe,"dask" is faster than "pandas", then I just googled "xarray dask", and it does exist. — Jeremy, Sep 26 '21 at 00:15
Or maybe you can first divide your coordinates into groups, then use "map" in Python to run parallel jobs. — Jeremy, Sep 26 '21 at 00:16
note that dask is not automatically faster than pandas or xarray; in fact, it is always slower (in CPU time) for in-memory problems. Dask is extremely useful for leveraging multiple cores or machines, but it's not a silver bullet for speeding things up. It sounds like the OP wants to apply this code to many datasets, so tuning this code without dask is needed so the full dask workflow (across many files) would be faster. — Michael Delgado, Sep 26 '21 at 17:17
That said, I would definitely recommend using dask to process your many files, e.g. by wrapping this code into a function and then applying it across your data with `client.map`. Alternatively, you can read in all the files with `xr.open_mfdataset` and then use the indexing I propose below. — Michael Delgado, Sep 26 '21 at 17:21
If this is something you need to do repeatedly, it may be worth chunking your files for time series access ussing `nccopy`. See [Chunking Data: Why it Matters](https://www.unidata.ucar.edu/blogs/developer/entry/chunking_data_why_it_matters). — Robert Davy, Sep 26 '21 at 22:12

Michael Delgado · Accepted Answer · 2022-01-21T18:30:01.133

8

This is a perfect use case for xarray's advanced indexing using a DataArray index.

# Make the index on your coordinates DataFrame the station ID,
# then convert to a dataset.
# This results in a Dataset with two DataArrays, lat and lon, each
# of which are indexed by a single dimension, stid
crd_ix = crd.set_index('stid').to_xarray()

# now, select using the arrays, and the data will be re-oriented to have
# the data only for the desired pixels, indexed by 'stid'. The
# non-indexing coordinates lat and lon will be indexed by (stid) as well.
NC.sel(lon=crd_ix.lon, lat=crd_ix.lat, method='nearest')

Other dimensions in the data will be ignored, so if your original data has dimensions (lat, lon, z, time) your new data would have dimensions (stid, z, time).

edited Jan 21 '22 at 18:30

answered Sep 26 '21 at 17:13

Michael Delgado

13,789
3
29
54

Thank you, the suggested method significantly reduced the time requirement. Though still the memory would be an issue opening the ncfiles, which I belive I could use "xr.open_mfdataset" to manage it. – Seji Sep 26 '21 at 21:44
Yep - I'd definitely use open_mfdataset, which will use dask to leverage multiple cores and manage memory. – Michael Delgado Sep 26 '21 at 21:49
1

Thanks both, great learning! – Jeremy Sep 26 '21 at 21:50

Jeremy · Answer 2 · 2021-09-26T00:51:32.083

I have a potential solution. The idea is to convert xarray data array to pandas first, then get a subset of the pandas dataframe based on lat/lon conditions.

# convert xarray data to a pandas dataframe
def xr_to_df(data):
    data = data.to_dataframe()
    data.reset_index(inplace=True)
    return data

# convert your xarray data to a pandas dataframe
full_df = xr_to_df(full_xarray)

# create a 2 columns pandas dataframe containing your target coordinates
points = pd.DataFrame({'lat':target_lat, 'lon':target_lon})

# get the values at your target points only via merging on the left
subset = pd.merge(points,full_df)

I am not sure for your size of data, how fast will this be. But at least, this avoids looping. I guess it should be faster?

I noticed that your points are randomly distributed (not on the grid centres). To address this, you can first write your own codes to regrid them onto the netcdf resolution, use things like np.argmin(abs(lat - lat_netcdf)) to find the nearest lat and lon.

The original question specifically says converting to pandas is not an acceptable answer. This approach will be very slow and memory inefficient. — Michael Delgado, Sep 26 '21 at 17:15
I tested the answer by @Jeremy also, it did improve the timing but converting the ncfiles to dataframes has two issues .... one is time, the other is memory... if I split the ncfiles into a yearly basis, the memory issue would solve, but it still requires roughly 8 mins to covert for each file ... so perhaps not the best solution. — Seji, Sep 26 '21 at 21:42

Efficient way to extract data from NETCDF files

2 Answers2

Linked