0

I have a dataframe that I am trying to calculate the year-to-date average for my value columns. Below is a sample dataframe.

    date       name values  values2
0   2019-01-01  a     1        1
1   2019-02-01  a     3        3
2   2019-03-01  a     2        2
3   2019-04-01  a     6        2

I want to create new columns (values_ytd & values2_ytd) that will average the values from January to the latest period within the same year (April in sample data). I will need to group the data by year & name when calculating the averages. I am looking for an output similar to this.

    date        name    values  values2  values2_ytd    values_ytd
0   2019-01-01   a        1        1         1              1
1   2019-02-01   a        3        3         2              2
2   2019-03-01   a        2        2         2              2
3   2019-04-01   a        6        2         2              3

I have tried unsuccesfully to using expanding().mean(), but most likely I was doing it wrong. My main dataframe has numerous name categories and many more columns. Here is the code I was attempting to use

df1.groupby([df1['name'], df1['date'].dt.year], as_index=False).expanding().mean().loc[:, 'values':'values2'].add_suffix('_ytd').reset_index(drop=True,level=0)

but am receiving the following error.

NotImplementedError: ops for Expanding for this dtype datetime64[ns] are not implemented

Note: This code below works perfectly when substituting cumsum() for .expanding().mean()to create a year-to-date sum of the values, but I cant figure it out for averages

df1.groupby([df1['name'], df1['date'].dt.year], as_index=False).cumsum().loc[:, 'values':'values2'].add_suffix('_ytd').reset_index(drop=True,level=0)

Any help is greatly appreciated.

2 Answers2

1

You should set date column as index: df.set_index('date', inplace=True) and then use df.resample('AS').groupby('name').mean()

FBruzzesi
  • 6,385
  • 3
  • 15
  • 37
  • Never used resample. Will this calculate YTD averages for all years separately? I have over 7 years of data and 6000 names. Each month should have a different value for each name. Would I reset the index after that to get the dataframe back to the original format? – Jack Valadez Dec 17 '19 at 08:17
  • The 'AS' parameter means year-start frequency, therefore it won't calculate YTD. You may want to check this post [link](https://stackoverflow.com/questions/56799202/pandas-groupby-and-cumulative-mean-of-previous-rows-in-group). Yes you can reset index afterward to get back the columns. – FBruzzesi Dec 17 '19 at 08:26
1

Try this:

df['date'] = pd.to_datetime(df['date'])
df.set_index('date', inplace=True)
df[['values2_ytd', 'values_ytd']] = df.groupby([df.index.year, 'name'])['values','values2'].expanding().mean().reset_index(level=[0,1], drop=True)


df
           name  values  values2  values2_ytd  values_ytd
date                                                     
2019-01-01    a       1        1          1.0         1.0
2019-02-01    a       3        3          2.0         2.0
2019-03-01    a       2        2          2.0         2.0
2019-04-01    a       6        2          3.0         2.0

Example using multiple names and years:

    date       name values  values2
0   2019-01-01  a     1        1
1   2019-02-01  a     3        3
2   2019-03-01  a     2        2
3   2019-04-01  a     6        2
4   2019-01-01  b     1        4
5   2019-02-01  b     3        4
6   2020-01-01  a     1        1
7   2020-02-01  a     3        3
8   2020-03-01  a     2        2
9   2020-04-01  a     6        2

Output:

           name  values  values2  values2_ytd  values_ytd
date                                                     
2019-01-01    a       1        1          1.0         1.0
2019-02-01    a       3        3          2.0         2.0
2019-03-01    a       2        2          2.0         2.0
2019-04-01    a       6        2          3.0         2.0
2019-01-01    b       1        4          1.0         4.0
2019-02-01    b       3        4          2.0         4.0
2020-01-01    a       1        1          1.0         1.0
2020-02-01    a       3        3          2.0         2.0
2020-03-01    a       2        2          2.0         2.0
2020-04-01    a       6        2          3.0         2.0
luigigi
  • 4,146
  • 1
  • 13
  • 30