Fastest way to slice and download hundreds of NetCDF files from THREDDS/OPeNDap server

Question

I am working with NASA-NEX-GDDP CMIP6 data. I currently have working code that individually opens and slices each file, however it takes days to download one variable for all model outputs and scenarios. My goal is to have all temperature and precipitation data for all models outputs and scenarios then apply climate indicators and make an ensemble with xclim.

url = 'https://ds.nccs.nasa.gov/thredds2/dodsC/AMES/NEX/GDDP-CMIP6/UKESM1-0-LL/ssp585/r1i1p1f2/tasmax/tasmax_day_UKESM1-0-LL_ssp585_r1i1p1f2_gn_2098.nc'
lat = 53
lon = 0

try:
    with xr.open_dataset(url) as ds:
        ds.interp(lat=lat,lon=lon).to_netcdf(url.split('/')[-1])
except Exception as e: print(e)

This code works but is very slow (days for one variable, one location). Wondering if there is a better, faster way? I'd rather not download the whole files as they are each 240 MB!

Update:

I have also tried the following to take advantage of dask parallel tasks and it is slightly faster but still on the order of days to complete for a full variable output:

 def interp_one_url(path,lat,lon):
       with xr.open_dataset(path) as ds: 
           ds = ds.interp(lat=lat,lon=lon)
           return ds 
urls = ['https://ds.nccs.nasa.gov/thredds2/dodsC/AMES/NEX/GDDP-CMIP6/UKESM1-0-LL/ssp585/r1i1p1f2/tasmax/tasmax_day_UKESM1-0-LL_ssp585_r1i1p1f2_gn_2100.nc',
        'https://ds.nccs.nasa.gov/thredds2/dodsC/AMES/NEX/GDDP-CMIP6/UKESM1-0-LL/ssp585/r1i1p1f2/tasmax/tasmax_day_UKESM1-0-LL_ssp585_r1i1p1f2_gn_2099.nc']
lat = 53
lon = 0
paths = [url.split('/')[-1] for url in urls]
datasets = [interp_one_url(url,lat,lon) for url in urls]
xr.save_mfdataset(datasets, paths=paths)

Hard to answer without full code. Are you only using one lat/lon? My guess is that `ds.isel(lat=lat,lon=lon, method='nearest')` is loading the entire global dataset into memory, so there is no difference. I can't imagine there is any other way for xarray to do nearest neighbour. Instead, you should specify the precise coordinate or cell indices needed — Robert Wilson, Jan 20 '22 at 08:58
@RobertWilson this is pretty much the full code that would run, I've omitted specific long/lat, but any could be plugged in. I've tried comparing speed with NumPy finding the nearest grid cell instead, and there are no real differences in speed to method='nearest'. I don't think I can avoid loading/downloading the whole dataset from the server into memory, but I'm not really sure — Clim Sci, Jan 20 '22 at 14:50
xarray or NCO on the command line should be able to download only a single location. If you want a full answer it is probably best to give a minimal reproducible example. Most of your code is irrelevant. You just need 3 lines of code to show your problem. — Robert Wilson, Jan 20 '22 at 15:03
@RobertWilson I've updated the code so it will run fully as is. If you have a method that works for downloading a single location without having to bring the whole file into memory I'd appreciate it. I've read the xarray docs I/O section and didn't seem to find a solution for that. — Clim Sci, Jan 20 '22 at 15:12
Your code is still mostly irrelevant. The looping has nothing whatsoever to do with your problem. If you want an answer to the question you should minimize the amount of time required to understand the problem — Robert Wilson, Jan 20 '22 at 15:28
As I said, it is best to provide a minimal reproducible example. This is neither minimal or reproducible. All netcdf files are different, as the coordinates are stored in totally different ways, so the problem has only specific solutions. You need to provide one url, the lon/lat you want and the interp code. 3-4 lines maximum — Robert Wilson, Jan 24 '22 at 16:36

torchern · Accepted Answer · 2022-01-26T19:06:21.837

One way is to download via the ncss portal instead of the OpenDAP, available via NASA. The URL is different but it is iterative as well.

e.g.

lat = 53
lon = 0
   
URL = "https://ds.nccs.nasa.gov/thredds/ncss/AMES/NEX/GDDP-CMIP6/ACCESS-CM2/historical/r1i1p1f1/pr/pr_day_ACCESS-CM2_historical_r1i1p1f1_gn_2014.nc?var=pr&north={}&west={}&east={}&south={}&disableProjSubset=on&horizStride=1&time_start=2014-01-01T12%3A00%3A00Z&time_end=2014-12-31T12%3A00%3A00Z&timeStride=1&addLatLon=true"

wget.download(URL.format(lat,lon,lon+1,lat-1) #north, west, east, south boundary

This accomplishes the slicing and download in one step. Once you have the URL, you can use something like wget, and complete downloads in parallel, which will speed up compared to selecting and saving one at a time

Fastest way to slice and download hundreds of NetCDF files from THREDDS/OPeNDap server

1 Answers1