1

I have a Pandas dataframe as follows

df = pd.DataFrame([['John', '1/1/2017','10'],
                   ['John', '2/2/2017','15'],
                   ['John', '2/2/2017','20'],
                   ['John', '3/3/2017','30'],
                   ['Sue', '1/1/2017','10'],
                   ['Sue', '2/2/2017','15'],
                   ['Sue', '3/2/2017','20'],
                   ['Sue', '3/3/2017','7'],
                   ['Sue', '4/4/2017','20']
                  ],
                   columns=['Customer', 'Deposit_Date','DPD'])

. What is the best way to calculate the PreviousMean column in the screen shot below?

The column is the year to date average of DPD for that customer. I.e. Includes all DPDs up to but not including rows that match the current deposit date. If no previous records existed then it's null or 0.

Screenshot: enter image description here

Notes:

  • the data is grouped by Customer Name and expanding over Deposit Dates
  • within each group, the expanding mean is calculated using only values from the previous rows.
  • at the start of each new customer the mean is 0 or alternatively null as there are no previous records on which to form the mean
  • the data frame is ordered by Customer Name and Deposit_Date
Reddspark
  • 6,934
  • 9
  • 47
  • 64
  • Reposted from link-only answer: [How to get rid of loops and use window functions, in Pandas or Spark SQL](https://medium.com/jbennetcodes/how-to-get-rid-of-loops-and-use-window-functions-in-pandas-or-spark-sql-907f274850e4) – Trenton McKinney Mar 03 '21 at 21:35

3 Answers3

2

instead of grouping & expanding the mean, filter the dataframe on the conditions, and calculate the mean of DPD:

  • Customer == current row's Customer
  • Deposit_Date < current row's Deposit_Date

Use df.apply to perform this operation for all row in the dataframe:

df['PreviousMean'] = df.apply(
    lambda x: df[(df.Customer == x.Customer) & (df.Deposit_Date < x.Deposit_Date)].DPD.mean(), 
axis=1)

outputs:

  Customer Deposit_Date  DPD  PreviousMean
0     John   2017-01-01   10           NaN
1     John   2017-02-02   15          10.0
2     John   2017-02-02   20          10.0
3     John   2017-03-03   30          15.0
4      Sue   2017-01-01   10           NaN
5      Sue   2017-02-02   15          10.0
6      Sue   2017-03-02   20          12.5
7      Sue   2017-03-03    7          15.0
8      Sue   2017-04-04   20          13.0
Haleemur Ali
  • 26,718
  • 5
  • 61
  • 85
  • The third assumption is incorrect but it's my fault for choosing a bad example. The zeroes are simply the start of a new customer and reflects the fact there were no previous records on which to form a mean. Alternatively they can be Null. – Reddspark May 30 '18 at 08:06
  • @user1761806, please update your sample data & expected output as it is still not clear what you are trying to achieve. is the `Deposit_Date` at the start of a new customer's records? If not, then I think `df['YTD_DPD_Mean'] = df.groupby(['Customer Name']).apply(lambda x: x.shift().expanding().mean())` is all you need – Haleemur Ali May 30 '18 at 09:12
  • Have updated as requested. I'm still working on it and trying something else at the moment. – Reddspark May 30 '18 at 10:05
  • @user1761806, please replace the screenshots with text. See this link for how to ask a [good pandas question](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples) – Haleemur Ali May 30 '18 at 11:00
  • Ok I have followed the advice in the link and provided Pandas code to generate the toy dataset. – Reddspark May 30 '18 at 11:11
0

Here's one way to exclude repeated days from mean calculation:

# create helper series which is NaN for repeated days, DPD otherwise
s = df.groupby(['Customer Name', 'Deposit_Date']).cumcount() == 1
df['DPD2'] = np.where(s, np.nan, df['DPD'])

# apply pd.expanding_mean
df['CumMean'] = df.groupby(['Customer Name'])['DPD2'].apply(lambda x: pd.expanding_mean(x))

# drop helper series
df = df.drop('DPD2', 1)

print(df)

  Customer Name Deposit_Date  DPD  CumMean
0          John   01/01/2017   10     10.0
1          John   01/01/2017   10     10.0
2          John   02/02/2017   20     15.0
3          John   03/03/2017   30     20.0
4           Sue   01/01/2017   10     10.0
5           Sue   01/01/2017   10     10.0
6           Sue   02/02/2017   20     15.0
7           Sue   03/03/2017   30     20.0
jpp
  • 159,742
  • 34
  • 281
  • 339
  • Thanks - this doesn't achieve the effect of only calculating based on dates prior to the current row though as the first 2 rows should show 0 or null. However will play around with this to see if it gets me anywhere. – Reddspark May 29 '18 at 22:47
  • @user1761806, Sure, I may not fully understand your problem. You *may* have to include DPD = 0 rows for each group to get it to work as you want. – jpp May 29 '18 at 22:49
  • Yeah I do. In my example the 10 = mean(10,10) and the 13.3 = mean (10,10,20) – Reddspark May 29 '18 at 22:55
0

Ok here is the best solution I've come up with thus far.

The trick is to first create an aggregated table at the customer & deposit date level containing a shifted mean. To calculate this mean you have to calculate the sum and the count first.

s=df.groupby(['Customer Name','Deposit_Date'],as_index=False)[['DPD']].agg(['count','sum'])
s.columns = [' '.join(col) for col in s.columns]
s.reset_index(inplace=True)

s['DPD_CumSum']=s.groupby(['Customer Name'])[['DPD sum']].cumsum()
s['DPD_CumCount']=s.groupby(['Customer Name'])[['DPD count']].cumsum()
s['DPD_CumMean']=s['DPD_CumSum']/ s['DPD_CumCount']
s['DPD_PrevMean']=s.groupby(['Customer Name'])['DPD_CumMean'].shift(1)

df=df.merge(s[['Customer Name','Deposit_Date','DPD_PrevMean']],how='left',on=['Customer Name','Deposit_Date'])
Reddspark
  • 6,934
  • 9
  • 47
  • 64