9

I have a pandas dataframe with irregularly spaced dates. Is there a way to use 7days as a moving window to calculate median absolute deviation, median etc..? I feel like I could somehow use pandas.rolling_apply but it does not take irregularly spaced dates for the window parameter. I found a similar post https://stackoverflow.com/a/30244019/3128336 and am trying to create my custom function but cannot still figure out.. Can anyone please help?

import pandas as pd
from datetime import datetime

person = ['A','B','C','B','A','C','A','B','C','A',]
ts = [
    datetime(2000, 1, 1),
    datetime(2000, 1, 1),
    datetime(2000, 1, 10),
    datetime(2000, 1, 20),
    datetime(2000, 1, 25),
    datetime(2000, 1, 30),
    datetime(2000, 2, 8),
    datetime(2000, 2, 12),
    datetime(2000, 2, 17),
    datetime(2000, 2, 20),
]
score = [9,2,1,3,8,4,2,3,1,9]
df = pd.DataFrame({'ts': ts, 'person': person, 'score': score})

df looks like this

    person  score   ts
0   A       9       2000-01-01
1   B       2       2000-01-01
2   C       1       2000-01-10
3   B       3       2000-01-20
4   A       8       2000-01-25
5   C       4       2000-01-30
6   A       2       2000-02-08
7   B       3       2000-02-12
8   C       1       2000-02-17
9   A       9       2000-02-20
Community
  • 1
  • 1
E.K.
  • 4,179
  • 8
  • 30
  • 50
  • Do you want a moving window or an expanding window? – Stop harming Monica Feb 06 '16 at 17:11
  • I want moving window. So something like this `pd.rolling_apply(df, window=relativedelta(days=7), func, min_periods=1)` I was confused between these two. Let me correct my post. Thanks for pointing out – E.K. Feb 06 '16 at 17:29
  • Can you explain what does not work with the solution to the question you linked to? I suppose that resampling your data to daily data before doing the rolling apply does remove the duplicate dates? – joris Feb 06 '16 at 18:10
  • I do not think I should use `pandas.resample` to fill with the previous value as this df contains test scores for different students. It is not like stock price for one company If so, I would agree that I could fill NAs for weekends with previous values. – E.K. Feb 06 '16 at 18:30

3 Answers3

5

You can use a time delta to select rows within your window and then use apply to run through each row and aggregate:

>>> from datetime import timedelta
>>> delta = timedelta(days=7)
>>> df_score_mean = df.apply(lambda x: np.mean(df['score'][df['ts'] <= x['ts'] + delta]), axis=1)
0    5.500000
1    5.500000
2    4.000000
3    4.600000
4    4.500000
5    4.500000
6    4.555556
7    4.200000
8    4.200000
9    4.200000
Brian Huey
  • 1,550
  • 10
  • 14
  • This is close to what I need! Just one question here, how can change the way shift works with apply? The lambda function you suggested works in the opposite way. `pd.rolling_median(df.score, window=2)`, for example, returns NA for the first row, not the last row. (I actually want to add a feature equivalent to `min_periods=1` to copy over the values for the first week though) – E.K. Feb 06 '16 at 20:37
  • 1
    I believe the lambda function shouldn't return NA for any rows because it will always select at least one row to perform the np.mean() function on. Are you asking how to change the window to be either forward looking or backward looking? In the lambda function, we select any rows less than or equal to the current row + 7 days. If you wanted to look backward 7 days you could select rows greater than or equal to the current row - 7 days. – Brian Huey Feb 07 '16 at 17:27
  • Ah that makes sense! Yes I need to look backward. – E.K. Feb 09 '16 at 04:42
0

I am not familiar enough with the rolling date functions - so I wondered about adding the missing data (in fact a Dataframe full of missing data) And then your rolling window should be easier to implement.

from datetime import date
import pandas as pd
##############Your Initial DataFrame ##############
person = ['A','B','C','B','A','C','A','B','C','A',]
ts = [
    datetime(2000, 1, 1),
    datetime(2000, 1, 1),
    datetime(2000, 1, 10),
    datetime(2000, 1, 20),
    datetime(2000, 1, 25),
    datetime(2000, 1, 30),
    datetime(2000, 2, 8),
    datetime(2000, 2, 12),
    datetime(2000, 2, 17),
    datetime(2000, 2, 15),
]
score = [9,2,1,3,8,4,2,3,1,9]
df = pd.DataFrame({'ts': ts, 'person': person, 'score': score})
################## Blank DataFrame in Same Format ###############
#Create some dates
start = date(2000,1,1)
end = date(2000,3,1)
#We have 3 people
Eperson=['A','B','C']
#They Score 0
Escore=[0]
#Need a date range in Days
ets=pd.date_range(start, end, freq='D')
dfEmpty=pd.DataFrame([(c,b,0) for b in Eperson for c in ets])
dfEmpty.columns=['ts','person','score']

################# Now Join them 

dfJoin=dfEmpty.merge(df,how='outer',on=['ts','person'])
dfJoin['score']=dfJoin.score_x+dfJoin.score_y
dfJoin.score.fillna(0,inplace=True)
del dfJoin['score_x']
del dfJoin['score_y']'

You now have the data frame will no missing dates per person - and if the original date was missing then the person/score will be 0.

I appreciate this may not work should you be dealing with millions of records.

Apologies for the non PEP type comments... it is still work in progress.

Tim Seed
  • 5,119
  • 2
  • 30
  • 26
0

Just posting my solution based on Brian Huey's suggestion.

from datetime import datetime, timedelta
import statsmodels.api as sm

delta = timedelta(days=7)

def calc_mad_mean(row):
    start = row['ts']
    end = start + delta
    subset = df['score'][(start <= df['ts']) & (df['ts'] < end)]
    return pd.Series({'mad': sm.robust.mad(subset), 'med': np.median(subset)})

first_wk = df.ts.iloc[0] + delta
results = df[first_wk < df.ts].apply(calc_mad_mean, axis=1)
df.join(results, how='outer')

Results

    person  score   ts           mad        med
0   A       9       2000-01-01   NaN        NaN
1   B       2       2000-01-01   NaN        NaN
2   C       1       2000-01-10   0.000000   1.0
3   B       3       2000-01-20   3.706506   5.5
4   A       8       2000-01-25   2.965204   6.0
5   C       4       2000-01-30   0.000000   4.0
6   A       2       2000-02-08   0.741301   2.5
7   B       3       2000-02-12   1.482602   2.0
8   C       1       2000-02-17   5.930409   5.0
9   A       9       2000-02-20   0.000000   9.0
Community
  • 1
  • 1
E.K.
  • 4,179
  • 8
  • 30
  • 50