Reading non-contiguous subsets of a variable in a fast way

Question

I have a netCDF4.Variable object:

<class 'netCDF4._netCDF4.Variable'>
int16 myvar(time, latitude, longitude)
    standard_name: my_var
    long_name: Something
    units: (0 - 1)
    add_offset: 0.499999843485
    scale_factor: 1.54488841466e-05
    _FillValue: -32767
    missing_value: -32767
unlimited dimensions: time
current shape = (13148, 1441, 2880)
filling off

This variable is a 3D variable where the first dimension is a temporal dimension and the 2 others spatial dimension.

I would like to access a subset of this variable containing:

A subset of the temporal range (e.g. from 7000 to 8000).
A subset of the points that are identified by indices in the flattened version of the spatial range - In the above example, indices would range between 0 and 1441 * 2880.

Basically, I have:

tmin = 7000
tmax = 8000
upts = [42829, 9289, 3242]

My current way of accessing this is:

z = np.zeros(len(upts),  tmax - tmin)
for i in range(tmin, tmax):
    z[:, i - tmin] = my_var[i, :, :].flatten()[upts]

I was wondering if there was a faster way to do this?

I cannot load the whole dataset in memory because it is already huge, and could be larger.

I also cannot work only with a single i because I want to operate on row of z (which corresponds to "columns" in my_var).

I haven't used `newCDF` in ages, but can't you just use `my_var[tmin:tmax, :, upts]`? That's how I'd do it with `numpy` arrays, and `h5py`. — hpaulj, Jan 30 '18 at 17:28
@hpaulj I am sorry I completely forgot to specify that `upts` contains index of the flattened array "spatial" part of the array. In this example, `upts` values could range from `0` to `1441 * 2880`. — Holt, Jan 30 '18 at 17:32
DO you have space to load `my_var[tmin:tmax,:,:]` and then do the `upts` indexing on the array? Is `upts` typically small compared to the time range? If you must iterate, do so on the smallest range. — hpaulj, Jan 30 '18 at 17:41
@hpaulj I won't have enough space most of the time, but if you have an idea when `tmax - tmin` is small, I'll be happy to hear it. `upts` is typically large compared to `tmax - tmin`, but not orders of magnitude largers. Even when `upts` is close to `tmax - tmin` (or smaller), iterating over the first dimension is much faster due, I think, to the way data are stored in netCDF (row major). — Holt, Jan 30 '18 at 20:04
Is the data chunked? If so, the best way to access it would depend on the chunking. For a large netCDF file, chunking is usually a good idea. — AF7, Jan 31 '18 at 08:00

score 1 · Answer 1 · answered Jan 31 '18 at 09:21

Not sure if this works memory-wise, but would you consider cutting down the file first from the command line outside python using NCO or CDO and then reading that from python? It depends on whether you need to repeatedly access different chunks of the file, or whether this is a one-off access.

The commands to hyperslab in NCO is

ncks -d dim_name,val1,val2 in.nc -O out.nc

If val is an integer the cutting is done using indices, if it is a float, it is done on the value of the dimension. You can put several dimensions on the argument list. (as given in this answer: NCO cropping a netcdf file using dimension values rather than indices )

CDO needs standard dimensions, lat, lon and time, that you seem to have, so you can cut a lat-lon box and a range of timesteps using

cdo seltimestep,7000,8000 -sellonlatbox,lon1,lon2,lat1,lat2 in.nc out.nc

CDO has the advantage that you can also cut the times using the date format. So you could cut from 7000 to 8000 using

cdo seldate,yyyymmdd,yyyymmdd in.nc out.nc

which I find very useful.

In my experience, CDO seems to choke on memory requirements when NCO works fine, so that may influence your choice. In any case, if you need to repeatedly pick out different sections of the file, this suggestion is not very useful for you, but hopefully might be useful for others.

score 0 · Answer 2 · answered Jan 30 '18 at 17:05

0

If you want to access the data for read only, then you don’t have to copy all the elements in your subset to a new list. Instead, you can just copy the references to the lists:

z = [my_var[i] for i in range(tmin, tmax)]

If you need a copy of the subset that you can edit without affecting the whole set, then you do have to make copies of the elements like your current code does.

answered Jan 30 '18 at 17:05

JasonR

401
3
11

I want to access data for read-only, but I need to access them column-wise in `z`, so I cannot do it your way (unless I iterate over the list, but this would be slower). – Holt Jan 30 '18 at 17:08
@Holt I’m a little confused, because the code to copy the list in your question copies them with temporal as the first dimension. In your code, and in mine, the second spatial dimension is the one that is stored contiguously. Which dimension do you want to have stored contiguously? – JasonR Jan 30 '18 at 17:38
Yes, the first dimension is the time in your example is mine, but I actually transpose `z` just after the loop because I want the spatial dimension stored contiguously. I already tried to read one "column" at a time and directly work with it, but this is way slower since netCDF (I think) stores data in row-major format. I will update my question to make this part clearer. – Holt Jan 30 '18 at 20:00

Reading non-contiguous subsets of a variable in a fast way

2 Answers2