2

I'm trying to set up a 5-day (adjustable) running mean for ratings I made for various dates with a Python Pandas DataFrame.

I can easily get the average mean by day using the following code

import pandas as pd
import datetime as dt
RTC = pd.read_csv(...loads file, has 'Date' and 'Rating' columns...)
daterange = RTC['Date'].max() - RTC['Date'].min()
days_means = []
for day_offset in range(daterange.days+1):
    filldate = (RTC['Date'].min() + dt.timedelta).strftime('%Y-%m-%d')
    days_means.append(RTC[RTC['Date']==filldate]['Rating'].mean())

I'm thinking that the most natural way to extend this would be to make filldate a list (or a series?) and then have a new mask like

RTC['Date'] in filldate

But if I do this I get an error that states

ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

I'd guess somewhere I'd want to put an any-statement somewhere in this, but I cannot get this working.

Does anyone have advice on how to make this work properly?

Thanks!

EDIT: For reference, here's what my data would look like

Date      Rating  OtherColumns...
1-1-2014  5       ...
1-2-2014  6
1-2-2014  7
1-3-2014  8
1-3-2014  2
1-4-2014  3
1-6-2014  6
...

So the 5-day mean for 1-3-2014 would be (5+6+7+8+2+3)/6. Note that there are two entries for 1-2-2014 and 1-3-2014 nothing for 1-5-2014.

ApproachingDarknessFish
  • 14,133
  • 7
  • 40
  • 79
tcmJOE
  • 23
  • 6
  • Is this relevant? [Pandas: rolling mean by time interval](http://stackoverflow.com/questions/15771472/pandas-rolling-mean-by-time-interval) – Lack Feb 01 '15 at 19:20
  • you have include the observation for the 1-4-2014 in your calc, is this by mistake? – JAB Feb 07 '15 at 23:55
  • @Lack Actually, yes! It links to https://stackoverflow.com/questions/14300768/pandas-rolling-computation-with-window-based-on-values-instead-of-counts which I think could be adapted to using timedeltas. – tcmJOE Feb 07 '15 at 23:55
  • @J Bradley, I'm including days forward in the running mean (2 days forward and 2 days back, rather than 4 days back to present. So I'm looking at a central moving average) though it's simply a shift of days. – tcmJOE Feb 07 '15 at 23:57

1 Answers1

0

Updating answer based on the new information:

 #set up new example frame
 rng = pd.DatetimeIndex(pd.to_datetime(['1/1/2014','1/2/2014','1/2/2014', '1/3/2014', '1/3/2014','1/4/2014','1/6/2014']))
df = pd.DataFrame({'rating':[5,6,7,8,2,3,6],'date':rng})

#set date as datetime index
df.set_index('date',inplace=True)

Calculating the 5 day centered mean. Because the dataframe can have missing days or days with more than 1 observation the data needs to be resample a frequency of days, with blank days filled to 0:

pd.rolling_mean(df.resample('1D',how='sum').fillna(0),5, min_periods=1, center=True)

This returns:

2014-01-01  9.333333
2014-01-02  7.750000
2014-01-03  6.200000
2014-01-04  6.400000
2014-01-05  4.750000
2014-01-06  3.000000

Note the 5 day moving average for 2014-01-03 is 31/5 not 31/6

Adding a solution that gives the average using the number of observations in a 5 day centered window rather than a five day rolling average.

   #create timedeltas for window
  forward =pd.Timedelta('2 days')
  back = pd.Timedelta('-2 days')

 def f(x):
    five_day_obs = df.rating[(df.date >= x + back) & (df.date < x+ forward)]
    return five_day_obs.sum().astype(float)/five_day_obs.count()

This returns:

df.date.apply(f)

0    6.000000
1    5.600000
2    5.600000
3    5.166667
4    5.166667
5    5.200000
6    4.500000
Name: date, dtype: int64
JAB
  • 12,401
  • 6
  • 45
  • 50
  • This looks like what I need. Will I run into issues if I have a different number of ratings for each day? – tcmJOE Feb 01 '15 at 22:08
  • No, it will work with a different number of observations each day. – JAB Feb 01 '15 at 23:29
  • @tcmJOE let me know if you have any issues. – JAB Feb 02 '15 at 02:56
  • So I'm trying as above except with min_periods = 1 (so it doesn't complain for days with missing records) and how='mean' ('sum' does not average). However, the answers are not corresponding to the values I'm calculating by hand. Perhaps I need to do something with the freq or how parameters? – tcmJOE Feb 02 '15 at 16:01
  • the `how` parameter handles the aggregation of the resample. If you use mean then the rolling mean will be calculated from the mean of the days ratings when upsampled to the frequency of days. Updated question to show how the resampling works. – JAB Feb 02 '15 at 16:25
  • I'm afraid this still does not answer my question. I'm not looking to resample anything. Instead, I have data like `Date Rating 1-1-2014 5 1-2-2014 6 1-2-2014 7 1-3-2014 2 1-6-2014 3` So there are multiple entries for various dates (unlike your example above). Unfortunately, the rolling_mean function is not giving me anything corresponding to the sort of mean I expect (e.g. 5-day mean for 1-3-2014 above should be (5+6+7+2)/4). – tcmJOE Feb 07 '15 at 23:16
  • I included the resampling in case you had multiple observations on the same date as it wasn't clear from your question. Subsequently you say: "Will I run into issues if I have a different number of ratings for each day?" If you want to use a windowing function then the data needs to be indexed into consistent periods. – JAB Feb 07 '15 at 23:24
  • I have added a solution that counts the sums observations in a 5 day centered window and divides them by the count of the observations rather than the days. – JAB Feb 08 '15 at 01:03