Get hourly average for each month from a netcdf file

Question

I have a netCDF file with the time dimension containing data by the hour for 2 years. I want to average it to get an hourly average for each hour of the day for each month. I tried this:

import xarray as xr
ds = xr.open_mfdataset('ecmwf_usa_2015.nc')    
ds.groupby(['time.month', 'time.hour']).mean('time')

but I get this error:

*** TypeError: `group` must be an xarray.DataArray or the name of an xarray variable or dimension

How can I fix this? If I do this:

ds.groupby('time.month', 'time.hour').mean('time')

I do not get an error but the result has a time dimension of 12 (one value for each month), whereas I want an hourly average for each month i.e. 24 values for each of 12 months. Data is available here: https://www.dropbox.com/s/yqgg80wn8bjdksy/ecmwf_usa_2015.nc?dl=0

I believe `ds` is an [xarray.Dataset](http://xarray.pydata.org/en/stable/generated/xarray.Dataset.html#xarray.Dataset) and not a [netCDF4.Dataset](https://unidata.github.io/netcdf4-python/#netCDF4.Dataset), is that correct? — SiggyF, Apr 03 '18 at 07:22
please provide some sample data, and clarify what should happen with hours where there is no data. If missing data should be taken into account, a `resample` is needed too — Maarten Fabré, Apr 03 '18 at 09:18
@SiggyF, you are right that ds is a xarray.Dataset that was produced by reading in a netCDF file — user308827, Apr 03 '18 at 17:24
@MaartenFabré, I will try and get a sample dataset (the full dataset is seveal GBs in size). You can assume that there is no missing data — user308827, Apr 03 '18 at 17:25
A minimal example with dummy (e.g. random) data usually works best. Although focused on Pandas, this question/answers might help with that: https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples — Bart, Apr 03 '18 at 18:00
data is available here: https://www.dropbox.com/s/yqgg80wn8bjdksy/ecmwf_usa_2015.nc?dl=0 — user308827, Apr 03 '18 at 21:52
You want hourly average for each month . ie. `24*30` approx for each month right? you mentioned 24 values for each of 12 months i.e `24*12` — Morse, Apr 05 '18 at 16:46
@Prateek, I want 24 values for each of 12 month i.e. `24 * 12` — user308827, Apr 05 '18 at 18:22
@user308827 wouldnt that be 24*30 for each month and 24*30*12 approx. for whole year? just having doubt — Morse, Apr 05 '18 at 18:28
xarray can't seem to group by 2 variables at the same, so it might not work for this case — Maarten Fabré, Apr 05 '18 at 19:30
@Prateek, sorry I should be clearer. I want to create hourly data for an average day for each month which is why it should be `24 * 12` — user308827, Apr 05 '18 at 21:02
@user308827 that would be `24x(no.of days in month)x12` total = `24x365` records. But you want monthly average based on hour.. — Morse, Apr 05 '18 at 22:52
@Prateek, I want an average day for each month. So for each month I want to find the average value at 1 AM, 2 AM... This makes it 24 values for that month. — user308827, Apr 05 '18 at 23:06

score 6 · Answer 1 · answered Apr 09 '18 at 04:38

You are getting TypeError: group must be an xarray.DataArray or the name of an xarray variable or dimension because ds.groupby() is supposed to take xarray dataset variable or array , you passed a list of variables.

You have two options:

1. xarray bins --> group by hour

Refer group by documentation group by documentation and convert dataset into splits or bins and then apply groupby('time.hour')

This is because applying groupby on month and then hour one by one or by together is aggregating all the data. If you split them you into month data you would apply group by - mean on each month.

You can try this approach as mentioned in documentation:

GroupBy: split-apply-combine

xarray supports “group by” operations with the same API as pandas to implement the split-apply-combine strategy:

Split your data into multiple independent groups. => Split them by months using groupby_bins

Apply some function to each group. => apply group by

Combine your groups back into a single data object. **apply aggregate function mean('time')

2. convert it into pandas dataframe and use group by

Warning : Not all netcdfs are convertable to panda dataframe , there may be meta data loss while conversion.

Convert ds into pandas dataframe by df = ds.to_dataframe()and use group by as you require by using pandas.Grouperlike

df.set_index('time').groupby([pd.Grouper(freq='1M'), 't2m']).mean()

Note : I saw couple of answers with pandas.TimeGrouper but its deprecated and one has to use pandas.Grouper now.

Since your data set is too big and question does not have minimized data and working on it consuming heavy resources I would suggest to look at these examples on pandas

score 5 · Answer 2 · answered Jan 29 '19 at 14:43

In case you didn't solve the problem yet, you can do it this way:

# define a function with the hourly calculation:
def hour_mean(x):
     return x.groupby('time.hour').mean('time')

# group by month, then apply the function:
ds.groupby('time.month').apply(hour_mean)

This is the same strategy as the one in the first option given by @Prateek and based on the documentation, but the documentation was not that clear for me, so I hope this helps clarify. You can't apply a groupby operation to a groupby object so you have to build it into a function and use .apply() for it to work.

score 1 · Answer 3 · answered Oct 31 '19 at 17:36

Another solution for the problem of retrieving a multitemporal groupby function over a netcdf file using xarray library is to use the xarray-DataArray method called "resample" coupled with the "groupby" method. This approach is also available for xarray-DataSet objects.

Through this approach, one can retrieve values like monthly-hourly mean, or other kinds of temporal aggregation (i.e.: annual monthly mean, bi-annual three-monthly sum, etc.).

The example below uses the standard xarray tutorial dataset of daily air temperature (Tair). Notice that I had to convert the time dimension of the tutorial data into a pandas datetime object. If this conversion were not applied, the resampling function would fail, and an error message would appear (see below):

Error message:

"TypeError: Only valid with DatetimeIndex, TimedeltaIndex or PeriodIndex, but got an instance of 'Index'"

Despite that timeindex problem (which could be another Issue for discussion in StackOverFlow), the code below presents two possible solutions for the multitemporal grouping problem in xarray objects. The first uses the xarray.core.groupby.DataArrayGroupBy class, while the second only uses the groupby method from the normal xarray-dataArray and xarray-DataSet classes.

Sincerely yours,

Philipe Riskalla Leal

Code snippet:

ds = xr.tutorial.open_dataset('rasm').load()

def parse_datetime(time):
    return pd.to_datetime([str(x) for x in time])

ds.coords['time'] = parse_datetime(ds.coords['time'].values)


# 1° Option for multitemporal aggregation:


time_grouper = pd.Grouper(freq='Y')

grouped = xr.core.groupby.DataArrayGroupBy(ds, 'time', grouper=time_grouper)

for idx, sub_da in grouped:
    print(sub_da.resample({'time':'3M'}).mean().coords)


 # 2° Option for multitemporal aggregation:


grouped = ds.groupby('time.year')
for idx, sub_da in grouped:
    print(sub_da.resample({'time':'3M'}).mean().coords)

score 0 · Answer 4 · answered Apr 03 '18 at 08:37

Not a python solution, but I think this is how you could do it using CDO in a bash script loop:

# loop over months:
for i in {1..12}; do
   # This gives the hourly mean for each month separately 
   cdo yhourmean -selmon,${i} datafile.nc mon${i}.nc
done
# merge the files
cdo mergetime mon*.nc hourlyfile.nc
rm -f mon*.nc # clean up the files

Note that if you data doesn't start in January then you will get a "jump" in the final file time... I think that can be sorted by setting the year after the yhourmean command if that is an issue for you.

thanks @Adrian, I am looking for a python soln, but your effort is appreciated — user308827, Apr 03 '18 at 17:24

score 0 · Answer 5 · answered Apr 07 '18 at 19:32

Whith this

import xarray as xr
ds = xr.open_mfdataset('ecmwf_usa_2015.nc')
print ds.groupby('time.hour' ).mean('time')

I get somthing like this:

Dimensions: (hour: 24, latitude: 93, longitude: 281) Coordinates:

longitude (longitude) float32 230.0 230.25 230.5 230.75 231.0 231.25 ... * latitude (latitude) float32 48.0 47.75 47.5 47.25 47.0 46.75 46.5 ... * hour (hour) int64 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 ...

I think that is what you want.

I thought the same.but this is 24 .OP wants 24*12 – Morse Apr 08 '18 at 18:10 — Morse, Apr 08 '18 at 18:10

Get hourly average for each month from a netcdf file

5 Answers5

You have two options:

1. xarray bins --> group by hour

2. convert it into pandas dataframe and use group by

Linked