Speeding up reading of very large netcdf file in python

Question

I have a very large netCDF file that I am reading using netCDF4 in python

I cannot read this file all at once since its dimensions (1200 x 720 x 1440) are too big for the entire file to be in memory at once. The 1st dimension represents time, and the next 2 represent latitude and longitude respectively.

import netCDF4 
nc_file = netCDF4.Dataset(path_file, 'r', format='NETCDF4')
for yr in years:
    nc_file.variables[variable_name][int(yr), :, :]

However, reading one year at a time is excruciatingly slow. How do I speed this up for the use cases below?

--EDIT

The chunksize is 1

I can read a range of years: nc_file.variables[variable_name][0:100, :, :]

There are several use-cases:

for yr in years:

numpy.ma.sum(nc_file.variables[variable_name][int(yr), :, :])

# Multiply each year by a 2D array of shape (720 x 1440)
for yr in years:
    numpy.ma.sum(nc_file.variables[variable_name][int(yr), :, :] * arr_2d)

# Add 2 netcdf files together 
for yr in years:
    numpy.ma.sum(nc_file.variables[variable_name][int(yr), :, :] + 
                 nc_file2.variables[variable_name][int(yr), :, :])

Are you sure reading in any other matter (e.g the entire file at once) would be any faster? Can you try with a cropped file? — ivan_pozdeev, Feb 16 '16 at 05:02
Any [essential profiling](http://stackoverflow.com/questions/582336/how-can-you-profile-a-python-script) done? — ivan_pozdeev, Feb 16 '16 at 05:04
Are you doing anything with the year's data once you read it? Can you read a range of years, e.g. `[1997:2007,:,:]`? — hpaulj, Feb 16 '16 at 09:17
thanks @hapulj, I can read a range of years. There are several use-cases. Edited question to reflect them. — user308827, Feb 18 '16 at 02:05

Pedro M Duarte · Accepted Answer · 2016-02-21T03:15:10.207

I highly recommend that you take a look at the xarray and dask projects. Using these powerful tools will allow you to easily split up the computation in chunks. This brings up two advantages: you can compute on data which does not fit in memory, and you can use all of the cores in your machine for better performance. You can optimize the performance by appropriately choosing the chunk size (see documentation).

You can load your data from netCDF by doing something as simple as

import xarray as xr
ds = xr.open_dataset(path_file)

If you want to chunk your data in years along the time dimension, then you specify the chunks parameter (assuming that the year coordinate is named 'year'):

ds = xr.open_dataset(path_file, chunks={'year': 10})

Since the other coordinates do not appear in the chunks dict, then a single chunk will be used for them. (See more details in the documentation here.). This will be useful for your first requirement, where you want to multiply each year by a 2D array. You would simply do:

ds['new_var'] = ds['var_name'] * arr_2d

Now, xarray and dask are computing your result lazily. In order to trigger the actual computation, you can simply ask xarray to save your result back to netCDF:

ds.to_netcdf(new_file)

The computation gets triggered through dask, which takes care of splitting the processing out in chunks and thus enables working with data that does not fit in memory. In addition, dask will take care of using all your processor cores for computing chunks.

The xarray and dask projects still don't handle nicely situations where chunks do not "align" well for parallel computation. Since in this case we chunked only in the 'year' dimension, we expect to have no issues.

If you want to add two different netCDF files together, it is as simple as:

ds1 = xr.open_dataset(path_file1, chunks={'year': 10})
ds2 = xr.open_dataset(path_file2, chunks={'year': 10})
(ds1 + ds2).to_netcdf(new_file)

I have provided a fully working example using a dataset available online.

In [1]:

import xarray as xr
import numpy as np

# Load sample data and strip out most of it:
ds = xr.open_dataset('ECMWF_ERA-40_subset.nc', chunks = {'time': 4})
ds.attrs = {}
ds = ds[['latitude', 'longitude', 'time', 'tcw']]
ds

Out[1]:

<xarray.Dataset>
Dimensions:    (latitude: 73, longitude: 144, time: 62)
Coordinates:
  * latitude   (latitude) float32 90.0 87.5 85.0 82.5 80.0 77.5 75.0 72.5 ...
  * longitude  (longitude) float32 0.0 2.5 5.0 7.5 10.0 12.5 15.0 17.5 20.0 ...
  * time       (time) datetime64[ns] 2002-07-01T12:00:00 2002-07-01T18:00:00 ...
Data variables:
    tcw        (time, latitude, longitude) float64 10.15 10.15 10.15 10.15 ...

In [2]:

arr2d = np.ones((73, 144)) * 3.
arr2d.shape

Out[2]:

(73, 144)

In [3]:

myds = ds
myds['new_var'] = ds['tcw'] * arr2d

In [4]:

myds

Out[4]:

<xarray.Dataset>
Dimensions:    (latitude: 73, longitude: 144, time: 62)
Coordinates:
  * latitude   (latitude) float32 90.0 87.5 85.0 82.5 80.0 77.5 75.0 72.5 ...
  * longitude  (longitude) float32 0.0 2.5 5.0 7.5 10.0 12.5 15.0 17.5 20.0 ...
  * time       (time) datetime64[ns] 2002-07-01T12:00:00 2002-07-01T18:00:00 ...
Data variables:
    tcw        (time, latitude, longitude) float64 10.15 10.15 10.15 10.15 ...
    new_var    (time, latitude, longitude) float64 30.46 30.46 30.46 30.46 ...

In [5]:

myds.to_netcdf('myds.nc')
xr.open_dataset('myds.nc')

Out[5]:

<xarray.Dataset>
Dimensions:    (latitude: 73, longitude: 144, time: 62)
Coordinates:
  * latitude   (latitude) float32 90.0 87.5 85.0 82.5 80.0 77.5 75.0 72.5 ...
  * longitude  (longitude) float32 0.0 2.5 5.0 7.5 10.0 12.5 15.0 17.5 20.0 ...
  * time       (time) datetime64[ns] 2002-07-01T12:00:00 2002-07-01T18:00:00 ...
Data variables:
    tcw        (time, latitude, longitude) float64 10.15 10.15 10.15 10.15 ...
    new_var    (time, latitude, longitude) float64 30.46 30.46 30.46 30.46 ...

In [6]:

(myds + myds).to_netcdf('myds2.nc')
xr.open_dataset('myds2.nc')

Out[6]:

<xarray.Dataset>
Dimensions:    (latitude: 73, longitude: 144, time: 62)
Coordinates:
  * time       (time) datetime64[ns] 2002-07-01T12:00:00 2002-07-01T18:00:00 ...
  * latitude   (latitude) float32 90.0 87.5 85.0 82.5 80.0 77.5 75.0 72.5 ...
  * longitude  (longitude) float32 0.0 2.5 5.0 7.5 10.0 12.5 15.0 17.5 20.0 ...
Data variables:
    tcw        (time, latitude, longitude) float64 20.31 20.31 20.31 20.31 ...
    new_var    (time, latitude, longitude) float64 60.92 60.92 60.92 60.92 ...

score 2 · Answer 2 · answered Feb 16 '16 at 11:09

2

Check chunking of file. ncdump -s <infile> will give the answer. If chunk size in time dimension is larger than one, You should read the same amount of years at once, otherwise You are reading several years at once from disk and using only one at a time. How slow is slow? Max few seconds per timestep sounds reasonable for an array of this size. Giving more info on what You do with the data later may give us more guidance on where the problem may be.

answered Feb 16 '16 at 11:09

kakk11

898
8
21

1 in time dimension and what is it in other dimensions? Can You also clarify how slow is "slow" in your case! – kakk11 Feb 18 '16 at 08:41
chunk size is 720 and 1440 for other dimensions. It takes a fraction of a second for each iteration of the loop. But it adds up when you have to iterate over 1200 years – user308827 Feb 18 '16 at 16:33
Then You may already be at the speed for current file and hardware. If You have an option to rewrite the data, You may try PyTables and convert the files to blosc compressed HDF5. This should be faster that zlib compressed NetCDF4, though file will be slightly bigger. As rewriting the file was not an option in Your question, I'll not add it to answer yet, but as I've recently converted NetCDF to PyTables, I could give You some hints. – kakk11 Feb 18 '16 at 17:54
thanks @kakk11, how slow/fast is the rewriting option? i.e. does it take so long to rewrite netcdf into hdf5 that subsequent speed benefits are useless? – user308827 Feb 18 '16 at 19:01
It's difficult to estimate the time before trying, 15-30min maybe? But You do it only once per file, all later analysis can be done on the hdf file and all Your analysis will run faster. You could also rechunk the data during conversion, which would make reading spatial subsets faster. So it really depends on how many times You plan to read the file. You could also try to parallelise the reading, but again depending on IO speed it may not justify the additional coding effort. – kakk11 Feb 18 '16 at 19:30

score 0 · Answer 3 · answered Feb 19 '16 at 07:18

0

This is Kinda Hacky, but may be the simplest solution:

Read subsets of the file into memory, then cPickle (https://docs.python.org/3/library/pickle.html) the file back to disk for future use. Loading your data from a pickled data structure is likely to be faster than parsing netCDF every time.

answered Feb 19 '16 at 07:18

knite

6,033
6
38
54

It is quite likely that writing/reading hdf5 with blosc compression, like in PyTables, is actually faster than cPickle. Not to mention the file size which can get very big for uncompressed numeric data! – kakk11 Feb 23 '16 at 16:01

Speeding up reading of very large netcdf file in python

3 Answers3