9

I'm having difficulty to solve a look-back or roll-over problem in dataframe or perhaps in groupby.

The following is a simple example of the dataframe I have:

              fruit    amount    
   20140101   apple     3
   20140102   apple     5
   20140102   orange    10
   20140104   banana    2
   20140104   apple     10
   20140104   orange    4
   20140105   orange    6
   20140105   grape     1
   …
   20141231   apple     3
   20141231   grape     2

I need to calculate the average value of 'amount' of each fruit in the previous 3 days for everyday, and create the following data frame:

              fruit     average_in_last 3 days
   20140104   apple      4
   20140104   orange     10
   ...

For example on 20140104, the previous 3 days are 20140101, 20140102 and 20140103 (note the date in the data frame is not continuous and 20140103 does not exist), the average amount of apple is (3+5)/2 = 4 and orange is 10/1=10, the rest is 0.

The sample data frame is very simple but the actual data frame is much more complicated and larger. Hope someone can shed some light on this, thank you in advance!

AZhao
  • 13,617
  • 7
  • 31
  • 54
user6396
  • 1,832
  • 6
  • 23
  • 38
  • 1
    This is two problems: looking at the fruits separately, and doing a rolling mean over irregular data. The first is handled with the `groupby' function, as in http://stackoverflow.com/questions/22615288/group-data-by-seasons-using-python-and-pandas?rq=1, and the second doesn't automatically work in pandas yet: see some approaches in http://stackoverflow.com/questions/15771472/pandas-rolling-mean-by-time-interval?rq=1. – cphlewis Feb 21 '15 at 06:21
  • 1
    seems like for the examples, apples should be (3+5+0)/3 = 2.67. Doing it that way makes things much easier. – Paul H Feb 21 '15 at 06:26
  • Irregular sampling is a problem I have, and I do need to compute by (3+5)/2=4. – user6396 Feb 21 '15 at 12:57

4 Answers4

8

Assuming we have a data frame like that in the beginning,

>>> df
             fruit  amount
2017-06-01   apple       1
2017-06-03   apple      16
2017-06-04   apple      12
2017-06-05   apple       8
2017-06-06   apple      14
2017-06-08   apple       1
2017-06-09   apple       4
2017-06-02  orange      13
2017-06-03  orange       9
2017-06-04  orange       9
2017-06-05  orange       2
2017-06-06  orange      11
2017-06-07  orange       6
2017-06-08  orange       3
2017-06-09  orange       3
2017-06-10  orange      13
2017-06-02   grape      14
2017-06-03   grape      16
2017-06-07   grape       4
2017-06-09   grape      15
2017-06-10   grape       5

>>> dates = [i.date() for i in pd.date_range('2017-06-01', '2017-06-10')]

>>> temp = (df.groupby('fruit')['amount']
    .apply(lambda x: x.reindex(dates)  # fill in the missing dates for each group)
                      .fillna(0)   # fill each missing group with 0
                      .rolling(3)
                      .sum()) # do a rolling sum
    .reset_index()
    .rename(columns={'amount': 'sum_of_3_days', 
                     'level_1': 'date'}))  # rename date index to date col


>>> temp.head()
   fruit        date  amount
0  apple  2017-06-01     NaN
1  apple  2017-06-02     NaN
2  apple  2017-06-03    17.0
3  apple  2017-06-04    28.0
4  apple  2017-06-05    36.0

# converts the date index into date column 
>>> df = df.reset_index().rename(columns={'index': 'date'})  
>>> df.merge(temp, on=['fruit', 'date'])
>>> df
          date   fruit  amount  sum_of_3_days
0   2017-06-01   apple       1                NaN
1   2017-06-03   apple      16               17.0
2   2017-06-04   apple      12               28.0
3   2017-06-05   apple       8               36.0
4   2017-06-06   apple      14               34.0
5   2017-06-08   apple       1               15.0
6   2017-06-09   apple       4                5.0
7   2017-06-02  orange      13                NaN
8   2017-06-03  orange       9               22.0
9   2017-06-04  orange       9               31.0
10  2017-06-05  orange       2               20.0
11  2017-06-06  orange      11               22.0
12  2017-06-07  orange       6               19.0
13  2017-06-08  orange       3               20.0
14  2017-06-09  orange       3               12.0
15  2017-06-10  orange      13               19.0
16  2017-06-02   grape      14                NaN
17  2017-06-03   grape      16               30.0
18  2017-06-07   grape       4                4.0
19  2017-06-09   grape      15               19.0
20  2017-06-10   grape       5               20.0
dbokers
  • 840
  • 1
  • 10
  • 12
  • Nice answer, +1. I think you're more likely to get upvotes if you could make it more concise. If nothing else, just make the sample data more compact. – JohnE Aug 17 '17 at 21:06
7

I also wanted to use rolling with groupby, this is why I landed on this page, but I believe that I have a workaround that is better than the previous suggestions.

You could do the following:

pivoted_df = pd.pivot_table(df, index='date', columns='fruits', values='amount')
average_fruits = pivoted_df.rolling(window=3).mean().stack().reset_index()

the .stack() is not necessary, but will transform your pivot table back to a regular df

3

you can do it like this:

>>> df
>>>
           fruit  amount
20140101   apple       3
20140102   apple       5
20140102  orange      10
20140104  banana       2
20140104   apple      10
20140104  orange       4
20140105  orange       6
20140105   grape       1

>>> g= df.set_index('fruit', append=True).groupby(level=1)
>>> res = g['amount'].apply(pd.rolling_mean, 3, 1).reset_index('fruit')
>>> res

           fruit          0
20140101   apple   3.000000
20140102   apple   4.000000
20140102  orange  10.000000
20140104  banana   2.000000
20140104   apple   6.000000
20140104  orange   7.000000
20140105  orange   6.666667
20140105   grape   1.000000

update

Well, as @cphlewis mentioned in comments, my code will not give the results you want. I've checked different approaches and the one I found so far is something like this (not sure about performance, though):

>>> df.index = [pd.to_datetime(str(x), format='%Y%m%d') for x in df.index]
>>> df.reset_index(inplace=True)
>>> def avg_3_days(x):
        return df[(df['index'] >= x['index'] - pd.DateOffset(3)) & (df['index'] < x['index']) & (df['fruit'] == x['fruit'])].amount.mean()

>>> df['res'] = df.apply(avg_3_days, axis=1)
>>> df

       index   fruit  amount  res
0 2014-01-01   apple       3  NaN
1 2014-01-02   apple       5    3
2 2014-01-02  orange      10  NaN
3 2014-01-04  banana       2  NaN
4 2014-01-04   apple      10    4
5 2014-01-04  orange       4   10
6 2014-01-05  orange       6    7
7 2014-01-05   grape       1  NaN
Roman Pekar
  • 107,110
  • 28
  • 195
  • 197
  • You get a different result for `20140104 apple ` than the question specified. Because of the irregular sampling, I think. (Although I agree with paul above that the result should be 2.66 to be consistent.) – cphlewis Feb 21 '15 at 09:21
  • ah I see, thanks for noticing, I'll check what I can do – Roman Pekar Feb 21 '15 at 09:39
  • Hi Roman, thank you for you help. I tried something similar, it is not fast on a big data frame, and there are some item missing, for example, on 2014-01-05, there is no apple, but there should be a average amount of apple from 2014-01-02 to 2014-01-04. – user6396 Feb 21 '15 at 14:14
0
df1.index=pd.to_datetime(df1.index,format='%Y%m%d')
def function1(ss:pd.Series):
    return ss.loc[ss.index<ss.index.max()].mean()

df1.reset_index().assign(av=df1.reset_index().groupby('fruit')
            .apply(lambda dd:dd.rolling('4d',on='index').amount.apply(function1))
            .droplevel(0)).set_index('index')


          fruit  amount    av
index                           
2014-01-01   apple       3   NaN
2014-01-02   apple       5   3.0
2014-01-02  orange      10   NaN
2014-01-04  banana       2   NaN
2014-01-04   apple      10   4.0
2014-01-04  orange       4  10.0
2014-01-05  orange       6   7.0
2014-01-05   grape       1   NaN
2014-12-31   apple       3   NaN
2014-12-31   grape       2   NaN
G.G
  • 639
  • 1
  • 5