1

I have a dataframe (snippet below) with index in format YYYYMM and several columns of values, including one called "month" in which I've extracted the MM data from the index column.

      index             st             us       stu     px month
0    202001      2616757.0      3287969.0  0.795858  2.036    01
1    201912      3188693.0      3137911.0  1.016183  2.283    12
2    201911      3610052.0      2752828.0  1.311398  2.625    11
3    201910      3762043.0      2327289.0  1.616492  2.339    10
4    201909      3414939.0      2216155.0  1.540930  2.508    09

What I want to do is make a new column called 'stavg' which takes the 5-year average of the 'st' column for the given month. For example, since the top row refers to 202001, the stavg for that row should be the average of the January values from 2019, 2018, 2017, 2016, and 2015. Going back in time by each additional year should pull the moving average back as well, such that stavg for the row for, say, 201205 should show the average of the May values from 2011, 2010, 2009, 2008, and 2007.

      index             st             us       stu     px month   stavg   
0    202001      2616757.0      3287969.0  0.795858  2.036    01     xxx    
1    201912      3188693.0      3137911.0  1.016183  2.283    12     xxx
2    201911      3610052.0      2752828.0  1.311398  2.625    11     xxx
3    201910      3762043.0      2327289.0  1.616492  2.339    10     xxx
4    201909      3414939.0      2216155.0  1.540930  2.508    09     xxx

I know how to generate new columns of data based on operations on other columns on the same row (such as dividing 'st' by 'us' to get 'stu' and extracting digits from index to get 'month') but this notion of creating a column of data based on previous values is really stumping me.

Any clues on how to approach this would be greatly appreciated!! I know that for the first five years of data, I won't be able to populate the 'stavg' column with anything, which is fine--I could use NaN there.

  • Does this answer your question? [Moving average or running mean](https://stackoverflow.com/questions/13728392/moving-average-or-running-mean) – Joe Apr 12 '20 at 05:59
  • https://stackoverflow.com/questions/40060842/moving-average-pandas – Joe Apr 12 '20 at 06:00
  • These basic moving average posts don't really get at the heart of the issue, which is that the values that need to be average are spread out in the series, not just the most recent values. I'm trying to average together only values for each respective month for the previous five years, not just take the average of the last five months like a simple moving average would do. – sendtoprinter Apr 12 '20 at 17:25
  • Having said the above, if someone were to generalize the explanation for the regular moving average code to apply to seasonal cases like mine, I think a lot of people would find that useful! – sendtoprinter Apr 13 '20 at 19:48

2 Answers2

1

Try defining a function and using apply method

df['year'] = (df['index'].astype(int)/100).astype(int)

def get_stavg(df, year, month):
    # get year from index

    df_year_month = df.query('@year - 5 <= year < @year and month == @month')
    return df_year_month.st.mean()


df['stavg'] = df.apply(lambda x: get_stavg(df, x['year'], x['month']), axis=1)
jcaliz
  • 3,891
  • 2
  • 9
  • 13
  • Hmm, this is not working for me in part because "unsupported operand types for //: 'str' and 'int'" which I guess means that my index is a string or int (as opposed to a float??). Any clue how to fix that? – sendtoprinter Apr 12 '20 at 12:47
  • `index`. is not referring to index of the dataframe, you have a column named `index`. which I following the format Year-Month, I thought this column was a integer. Well it is hard to deal with comparison to strings, you can do `df['index'] = df['index'].astype(int)`. before all of this, and let me know if that works. – jcaliz Apr 12 '20 at 18:51
  • Thanks for the additional help, jcaliz. I now get "TypeError: unsupported operand type(s) for //: 'VariableNode' and 'int'". So it still seems like it's not happy with 'index' even after adding df['index'] = df['index'].astype(int) beforehand in the code. Does // require some other type of variable? Really appreciate the help! – sendtoprinter Apr 13 '20 at 11:35
  • `//`. is just handy way to get the integer part of a division, I have updated the solution with a more simplistic version, however can you put the output of `df.dtypes` I would love to know what datatype you have in the dataframe. – jcaliz Apr 13 '20 at 18:21
  • Thanks so much, jcaliz, I think this works! Here's what I saw from df.dtypes: {index object NG.N5020US2.M float64 NG.N9140US2.M float64 stu float64 px float64 month object} In any case, the new way seems to work, so I really appreciate your help and patience. – sendtoprinter Apr 13 '20 at 19:44
  • That's interesting, you don't have any strange dtypes, I wonder why the intial solution didn't work, anyway ahppy to help. – jcaliz Apr 13 '20 at 19:46
0

If you are looking for a pandas only solution you could do something like

Dummy Data

Here we create a dummy datasets with 10 years of data with only two months (Jan and Feb).

import pandas as pd


df1 = pd.DataFrame({"date":pd.date_range("2010-01-01", periods=10, freq="AS-JAN")})
df2 = pd.DataFrame({"date":pd.date_range("2010-01-01", periods=10, freq="AS-FEB")})
df1["n"] = df1.index*2
df2["n"] = df2.index*3
df = pd.concat([df1, df2]).sort_values("date").reset_index(drop=True)

df.head(10)
        date   n
0 2010-01-01   0
1 2010-02-01   0
2 2011-01-01   2
3 2011-02-01   3
4 2012-01-01   4
5 2012-02-01   6
6 2013-01-01   6
7 2013-02-01   9
8 2014-01-01   8
9 2014-02-01  12

Groupby + rolling mean

df["n_mean"] = df.groupby(df["date"].dt.month)["n"]\
                 .rolling(5).mean()\
                 .reset_index(0,drop=True)
         date   n  n_mean
0  2010-01-01   0     NaN
1  2010-02-01   0     NaN
2  2011-01-01   2     NaN
3  2011-02-01   3     NaN
4  2012-01-01   4     NaN
5  2012-02-01   6     NaN
6  2013-01-01   6     NaN
7  2013-02-01   9     NaN
8  2014-01-01   8     4.0
9  2014-02-01  12     6.0
10 2015-01-01  10     6.0
11 2015-02-01  15     9.0
12 2016-01-01  12     8.0
13 2016-02-01  18    12.0
14 2017-01-01  14    10.0
15 2017-02-01  21    15.0
16 2018-01-01  16    12.0
17 2018-02-01  24    18.0
18 2019-01-01  18    14.0
19 2019-02-01  27    21.0

By definition for the first 4 years the result is NaN.

Update

For your particular case

import pandas as pd

index = [f"{y}01" for y in range(2010, 2020)] +\
        [f"{y}02" for y in range(2010, 2020)]

df = pd.DataFrame({"index":index})
df["st"] = df.index + 1
# dates/ index should be sorted
df = df.sort_values("index").reset_index(drop=True)

# extract month
df["month"] = df["index"].str[-2:]


df["st_mean"] = df.groupby("month")["st"]\
                  .rolling(5).mean()\
                  .reset_index(0,drop=True)


rpanai
  • 12,515
  • 2
  • 42
  • 64
  • Thanks, rpanai. This looks promising but I can't quite get it to work because my original index column is not actually a datetime series. It's just a YYYYMM string. I tried to convert it using df.index.to_datetime() but I get an error "'RangeIndex' object has no attribute 'to_datetime'" where I got stuck. Do you know of any other ways to convert and index column into a datetime? – sendtoprinter Apr 12 '20 at 12:28