0

I'm looking to do a rolling mean on a dataframe, but the rolling mean has to cover the length of the column in a timestamp.

For example, in time[1] (with one row), compute rolling mean on all the column rows(1), then on time[2], do the same on all rows(2), and so on. Progressing the window as the timestamp progresses.

Also, this is done by group. So if it's doing this on group A, it has to sort of reset, as it goes to group B

It's kinda like this topic: Computing rolling mean in data.table with adaptive window lengths

But on python, and considering timestamp.

Also, the dataset is big, so it has to be something optimal, iterations and loops would take years

  • Could you provide a minimal part of the dataset in order to represent what you need fully ? – ozturkib Oct 23 '20 at 08:19
  • `dataframe=pd.Dataframe({'timestamp':[0,1,2,3,4,0,1,2,3,4,5], 'user':['A','A','A','A','B','B','B','B','B'], 'answered_correctly':[0,1,1,0,1,1,1,0,0]})` – Iuryck Santos Oct 26 '20 at 19:46
  • This would be an example of the dataset. So I would need to get the Mean Accuracy of a user, without getting values from future timestamp, and considering all past timestamps. So for user A it would be [0, 0.5 , 0.66 , 0.5] – Iuryck Santos Oct 26 '20 at 19:50
  • have you tried below answer ? – ozturkib Oct 28 '20 at 10:37

2 Answers2

1

Simple moving average has a sliding window of constant size. Rolling mean is calculated by averaging data of the time series within x (generally constant) periods of time. However, you need to calculate the cumulative moving average of answered_correctly for each group. The cumulative moving average takes into account all the preceding values when calculating the average as you described in your question.

import pandas as pd
df = pd.DataFrame({
    'timestamp'         :[3,1,2,3,4,0,1,2,0], 
    'user'              :['B','A','A','A','B','B','B','B','A'], 
    'answered_correctly':[0,1,1,0,1,1,1,0,0]
    })
    
# sort the dataframe by the timestamp
df.sort_values('timestamp', inplace=True)
df.groupby('user')['answered_correctly'].expanding().mean()

The output:

user   
A     8    0.000000
      1    0.500000
      2    0.666667
      3    0.500000
B     5    1.000000
      6    1.000000
      7    0.666667
      0    0.500000
      4    0.600000

If you would like to access group A:

df.groupby('user')['answered_correctly'].expanding().mean().A

The output:

8    0.000000
1    0.500000
2    0.666667
3    0.500000
Name: answered_correctly, dtype: float64

If you would like to access values of group A:

df.groupby('user')['answered_correctly'].expanding().mean().A.get_values()

The output:

array([0.        , 0.5       , 0.66666667, 0.5       ])
ozturkib
  • 1,493
  • 16
  • 28
  • Real thx man. I actually found a different solution to it. I'll test both and compare. Since I'm working with a really large dataset, every optimization counts. – Iuryck Santos Oct 28 '20 at 17:07
0

In case anybody has the same problem, an alternative method to Ozturkib's answer is to do as follows:

by = ['user_id']
on = ['answered_correctly']
how = ['count']



agg = train.groupby(by)[on].apply(lambda x: (
    x
    .shift(1)
    .rolling(window=len(x), min_periods=1)
    .agg(how)))

This will leave NAN's in the first rows, in my case I prefer these NAN's but if you don't want them just put "min_periods" to 0