7

I am trying to use pandas to compute daily climatology. My code is:

import pandas as pd

dates      = pd.date_range('1950-01-01', '1953-12-31', freq='D')
rand_data  = [int(1000*random.random()) for i in xrange(len(dates))]
cum_data   = pd.Series(rand_data, index=dates)
cum_data.to_csv('test.csv', sep="\t")

cum_data is the data frame containing daily dates from 1st Jan 1950 to 31st Dec 1953. I want to create a new vector of length 365 with the first element containing the average of rand_data for January 1st for 1950, 1951, 1952 and 1953. And so on for the second element...

Any suggestions how I can do this using pandas?

BenMorel
  • 34,448
  • 50
  • 182
  • 322
user308827
  • 21,227
  • 87
  • 254
  • 417

4 Answers4

8

You can groupby the day of the year, and the calculate the mean for these groups:

cum_data.groupby(cum_data.index.dayofyear).mean()

However, you have the be aware of leap years. This will cause problems with this approach. As alternative, you can also group by the month and the day:

In [13]: cum_data.groupby([cum_data.index.month, cum_data.index.day]).mean()
Out[13]:
1  1     462.25
   2     631.00
   3     615.50
   4     496.00
...
12  28    378.25
    29    427.75
    30    528.50
    31    678.50
Length: 366, dtype: float64
joris
  • 133,120
  • 36
  • 247
  • 202
  • 1
    Hi Joris, are you maybe aware of any clever way to calculate daily climatology by including adjacent days? For example for 3rd January and a 5-day window, the idea will be to use all 1,2,3,4,5 Jan days for all available years and calculate the mean. The same will apply for all other days of the year. Atm, I can do that with a loop, but if there is any more clever and/or faster way that can avoid loops, it would be useful to know. – NikMas Mar 19 '20 at 16:08
  • I think the easiest is to first do a rolling window mean with a window of 5 days, and then in a second step doing the groupby by year (or the other way around, that might actually be more efficient as you do the rolling window with less data). If you have a regular time series (all days available for all years), taking the mean in two steps shouldn't matter. But if that doesn't answer the question, best to open a separate question. – joris Mar 19 '20 at 19:09
4

Hoping it can be of any help, I want to post my solution to get a climatology series with the same index and length of the original time series.

I use joris' solution to get a "model climatology" of 365/366 elements, then I build my desired series taking values from this model climatology and time index from my original time series. This way, things like leap years are automatically taken care of.

#I start with my time series named 'serData'.
#I apply joris' solution to it, getting a 'model climatology' of length 365 or 366.
serClimModel = serData.groupby([serData.index.month, serData.index.day]).mean()

#Now I build the climatology series, taking values from serClimModel depending on the index of serData.
serClimatology = serClimModel[zip(serData.index.month, serData.index.day)]

#Now serClimatology has a time index like this: [1,1] ... [12,31].
#So, as a final step, I take as time index the one of serData.
serClimatology.index = serData.index
DarioZapp
  • 41
  • 4
3

@joris. Thanks. Your answer was just what I needed to use pandas to calculate daily climatologies, but you stopped short of the final step. Re-mapping the month,day index back to an index of day of the year for all years, including leap years, i.e. 1 thru 366. So I thought I'd share my solution for other users. 1950 thru 1953 is 4 years with one leap year, 1952. Note since random values are used each run will give different results.

...   
from datetime import date
doy = []
doy_mean = []
doy_size = []
for name, group in cum_data.groupby([cum_data.index.month, cum_data.index.day]):
  (mo, dy) = name
  # Note: can use any leap year here.
  yrday = (date(1952, mo, dy)).timetuple().tm_yday
  doy.append(yrday)
  doy_mean.append(group.mean())
  doy_size.append(group.count())
  # Note: useful climatology stats are also available via group.describe() returned as dict
  #desc = group.describe()
  # desc["mean"], desc["min"], desc["max"], std,quartiles, etc.

# we lose the counts here.
new_cum_data  = pd.Series(doy_mean, index=doy)
print new_cum_data.ix[366]
>> 634.5

pd_dict = {}
pd_dict["mean"] = doy_mean
pd_dict["size"] = doy_size
cum_data_df = pd.DataFrame(data=pd_dict, index=doy)

print cum_data_df.ix[366]
>> mean    634.5
>> size      4.0
>> Name: 366, dtype: float64
# and just to check Feb 29
print cum_data_df.ix[60]
>> mean    343
>> size      1
>> Name: 60, dtype: float64
Eric Bridger
  • 3,751
  • 1
  • 19
  • 34
  • Hello @user308827 I'm using your new code to calculate daily climatology. But if i want the armonic means? Im using doy_harmonic_mean.append(group.statistics.harmonic_mean()) but i get an error AttributeError: 'Series' object has no attribute 'statistics'. How can i apply armonic means in this code? – Javier Feb 17 '21 at 16:48
0

Groupby month and day is a good solution. However, the perfect thinking of groupby(dayofyear) is still possible if you use xrray.CFtimeIndex instead of pandas.DatetimeIndex. i.e,

  1. Delete feb29 by using
rand_data=rand_data[~((rand_data.index.month==2) & (rand_data.index.day==29))]
  1. Replace the index of the above data by xrray.CFtimeIndex, i.e.,
index = xarray.cftime_range('1950-01-01', '1953-12-31', freq='D', calendar = 'noleap')
index = index[~((index.month==2)&(index.day==29))]
rand_data['time']=index

Now, for both non-leap and leap year, the 60th dayofyear would be March 1st, and the total number of dayofyear would be 365. groupbyyear would be correct to calculate climatological daily mean.

QuanLiu
  • 1
  • 1