0

First time poster here, so apologies if I haven't quite go this questioning correct. Spent many years manipulating data in Excel and PowerPivot, but current project required something with more heavy lifting capabilities. Have been looking at Pandas and think it can more than handle the job, but I'm stuck.

I am trying to calculate the number of days between purchases for each customer

My initial dataframe looks like this:

    customer_id date        invoice_amt 
0   101A        21/03/2012  654.76      
1   101A        1/02/2012   234.45      
2   102A        23/01/2012  99.45       
3   104B        18/12/2011  767.63      
4   101A        9/12/2011   124.76      
5   104B        27/11/2011  346.87      
6   102A        18/11/2011  652.65      
7   104B        12/10/2011  765.21      
8   101A        1/10/2011   275.76      
9   102A        21/09/2011  532.21  

My target dataframe looks like:

customer_id date        invoice_amt days_since  
0   101A        21/03/2012  654.76      49
1   101A        1/02/2012   234.45      54
2   102A        23/01/2012  99.45       66
3   104B        18/12/2011  767.63      21
4   101A        9/12/2011   124.76      69
5   104B        27/11/2011  346.87      46
6   102A        18/11/2011  652.65      58
7   104B        12/10/2011  765.21      NaN
8   101A        1/10/2011   275.76      NaN
9   102A        21/09/2011  532.21      NaN

I have gotten to the point of being able to calculate the days_since values in each grouped dataframe, but not sure how to get the values back into the main dataframe (data_df)

Any help would be very much appreciated ... thank you

import pandas as pd
#import numpy as np

#dataframe data note: no_days_since_last_purchase hard coded for testing purposes
my_data = {'customer_id' : ['101A', '101A', '102A', '104B', '101A', '104B', '102A', '104B', '101A', '102A' ],
          'date' : ['20120321','20120201','20120123','20111218','20111209','20111127','20111118','20111012','20111001','20110921'],
          'invoice_amt' : [654.76, 234.45, 99.45, 767.63, 124.76, 346.87, 652.65, 765.21, 275.76, 532.21 ],
          'no_days_since_last_purchase' : ['49', '54', '66', '21', '69', '46', '58', 'NaN', 'NaN', 'NaN']}

data_df = pd.DataFrame(my_data).sort_index(by='date',ascending=True)

#convert date str to date type
data_df['date'] = pd.to_datetime(data_df['date'].astype(str),format='%Y%m%d')

#group dataframe by customer_id  
grouped_data = data_df.groupby(['customer_id'])    

#for each row in each grouped dataframe calculate the difference in days between current and previous
#if there is no previous then use 2000-01-01 then convert to integer
for customer_id, group in grouped_data:
    group['days_since'] = (group['date'] - group['date'].shift().fillna(pd.datetime(2000,1,1))).astype('timedelta64[D]')
    print group

OUTPUT:

  customer_id       date  invoice_amt no_days_since_last_purchase  days_since
8        101A 2011-10-01       275.76                         NaN        4291
4        101A 2011-12-09       124.76                          69          69
1        101A 2012-02-01       234.45                          54          54
0        101A 2012-03-21       654.76                          49          49
  customer_id       date  invoice_amt no_days_since_last_purchase  days_since
9        102A 2011-09-21       532.21                         NaN        4281
6        102A 2011-11-18       652.65                          58          58
2        102A 2012-01-23        99.45                          66          66
  customer_id       date  invoice_amt no_days_since_last_purchase  days_since
7        104B 2011-10-12       765.21                         NaN        4302
5        104B 2011-11-27       346.87                          46          46
3        104B 2011-12-18       767.63                          21          21

Oh I do get the SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead

Any thoughts on what I should do to avoid this warning would also be appreciated.

vsquared
  • 27
  • 5
  • possible duplicate of [Setting values on a copy of a slice from a DataFrame](http://stackoverflow.com/questions/31468176/setting-values-on-a-copy-of-a-slice-from-a-dataframe) – firelynx Jul 22 '15 at 07:51

2 Answers2

1

Use transform to produce a Series with it's index aligned to your original df, you can then assign as a new column, additionally you can't cast datetime64[ns] using astype to timedelta[D] so you have an additional step to call to_timedelta:

In [193]:
data_df['days_since'] = data_df.groupby(['customer_id'])['date'].transform(lambda x: x - x.shift().fillna(pd.datetime(2000,1,1)))
data_df['days_since'] = pd.to_timedelta(data_df['days_since'])
data_df

Out[193]:
  customer_id       date  invoice_amt no_days_since_last_purchase  days_since
9        102A 2011-09-21       532.21                         NaN   4281 days
8        101A 2011-10-01       275.76                         NaN   4291 days
7        104B 2011-10-12       765.21                         NaN   4302 days
6        102A 2011-11-18       652.65                          58     58 days
5        104B 2011-11-27       346.87                          46     46 days
4        101A 2011-12-09       124.76                          69     69 days
3        104B 2011-12-18       767.63                          21     21 days
2        102A 2012-01-23        99.45                          66     66 days
1        101A 2012-02-01       234.45                          54     54 days
0        101A 2012-03-21       654.76                          49     49 days

EDIT

Actually you can call to_timedelta on the returned series like so:

data_df['days_since'] = pd.to_timedelta(data_df.groupby(['customer_id'])['date'].transform(lambda x: x - x.shift().fillna(pd.datetime(2000,1,1))))
EdChum
  • 376,765
  • 198
  • 813
  • 562
0
df_container = []
for customer_id, group in grouped_data:
    group['days_since'] = (group['date'] - group['date'].shift().fillna(pd.datetime(2000,1,1))).astype('timedelta64[D]')
    df_container.append(group)

data_df = pd.concat(df_container)

Maybe is this what you want?

  customer_id       date  invoice_amt no_days_since_last_purchase  days_since
8        101A 2011-10-01       275.76                         NaN        4291
4        101A 2011-12-09       124.76                          69          69
1        101A 2012-02-01       234.45                          54          54
0        101A 2012-03-21       654.76                          49          49
9        102A 2011-09-21       532.21                         NaN        4281
6        102A 2011-11-18       652.65                          58          58
2        102A 2012-01-23        99.45                          66          66
7        104B 2011-10-12       765.21                         NaN        4302
5        104B 2011-11-27       346.87                          46          46
3        104B 2011-12-18       767.63                          21          21
su79eu7k
  • 7,031
  • 3
  • 34
  • 40