Easy, scriptable way to sub-sample unstructured THREDDS data?

Question

I'm trying to get a subset of data from a triangular mesh model that is being served by THREDDS. I'd like to be able to specify a LAT/LON bounding box and just get the data from within that box. The Data URL is:

http://www.smast.umassd.edu:8080/thredds/dodsC/FVCOM/NECOFS/Forecasts/NECOFS_MET_FORECAST.nc

With gridded data it is pretty easy to subset the data from a THREDDS server. Does anyone know what the best way is to get a sub-domain of a triangular mesh that is served by THREDDS?

For gridded data I use Ferret as my OPeNDAP client and I'm able to script the download process. I'd like to do something similar here although I could use Matlab, Python or other tools.

Thanks,

Steve

score 5 · Accepted Answer · edited Jun 15 '16 at 14:52

Opendap and NetCDF don't allow extraction using an irregular indexing. You can only request a start, stop and stride.

And because this is a triangular grid, there is no guarantee that nodes of the triangles in the same region have similar indices. So if you want to get only those nodes within a bounding box, you would have to request them one-by-one. And that is slow. So in many cases it's faster to determine the minimum and maximum indices and request that whole chunk in one piece, then extract the indices as needed.

Here's a sample comparison of the two approaches in python. In this example, extracting the subset which includes all the indices is about 10 times faster than looping over each index, extracting time series:

import netCDF4
import time
import numpy as np

url='http://www.smast.umassd.edu:8080/thredds/dodsC/FVCOM/NECOFS/Forecasts/NECOFS_GOM3_FORECAST.nc'
nc = netCDF4.Dataset(url)
ncv = nc.variables
lon = ncv['lon'][:]
lat = ncv['lat'][:]
tim = ncv['time'][:]

# find indices inside box
box = [-71.4,41,-70.2,41.5]
ii = (lon>=box[0])&(lon<=box[2])&(lat>=box[1])&(lat<=box[3])
# jj will have just indices from inside the box:
jj = np.where(ii)[0]

If we loop over each index, it's slow:

# loop over indices, extracting time series
time0=time.time()
zi = np.zeros((len(tim),len(jj)))
k=0
for j in jj:
    zi[:,k]=ncv['zeta'][:,j]
    k +=1
print('elapsed time: %d seconds' % (time.time()-time0))

elapsed time: 56 seconds

But if we loop over the range at each time step, it's much faster:

time0=time.time()
zi2 = np.zeros((len(tim),len(jj)))
jmin=jj.min()
jmax=jj.max()

for i in range(len(tim)):
    ztmp = ncv['zeta'][i,jmin:jmax+1]
    zi2[i,:] = ztmp[jj-jmin]
print('elapsed time: %d seconds' % (time.time()-time0))

elapsed time: 6 seconds

Of course, you results may vary depending on the size of the unstructured grid, the proximity of the points in your subset, the number of points you are extracting, etc. But hopefully this gives you an idea of the issues involved.

Thanks Rich. This is very helpful. I'll try to come up with something based on this. The goal is to have a NetCDF file with the subset. In Ferret it was as simple as: `save/file=myfile.nc/i=55:94/j=204:253 TEMP,SALINITY` to create a small file of just Temperature and Salinity in the area I am interested in. I'll work with what you provided and try to add code to save to NetCDF. — Steve Cousins, Nov 07 '14 at 22:03

acrosby · Answer 2 · 2014-11-07T17:58:59.413

Rich's answer provides the best way to do this if you have really tight performance constraints. But, as he says, results may vary by mesh and subset region.

If you are only interested in 1 timestep at a time (no pun intended), there is marginal cost and much easier code to just grab the whole spatial domain as-is from the dap server:

time0 = time.time()
zi3 = ncv['zeta'][0, :]
zi3 = zi3[jj]
print('elapsed time: %d seconds' % (time.time() - time0))

elapsed time: 0 seconds

When profiled (even when pre-allocating the arrays), the coarse min/max subset that Rich does is about 2x as fast though, given this mesh and specific spatial subset. The NECOFS mesh is relative small in the world of unstructured meshes. You might run into problems with Atlantic scale ADCIRC meshes for instance.

POSTSCRIPT: You are grabbing the whole grid when you get lat and lon anyway to determine your subsetting indices.

Easy, scriptable way to sub-sample unstructured THREDDS data?

2 Answers2