Faster loop in Pandas looking for ID and older date

Question

So, I have a DataFrame that represents purchases with 4 columns:

date (date of purchase in format %Y-%m-%d)
customer_ID (string column)
claim (1-0 column that means 1-the customer complained about the purchase, 0-customer didn't complain)
claim_value (for claim = 1 it means how much the claim cost to the company, for claim = 0 it's NaN)

I need to build 3 new columns:

past_purchases (how many purchases the specific customer had before this purchase)
past_claims (how many claims the specific customer had before this purchase)
past_claims_value (how much did the customer's past claims cost)

This has been my approach until now:

past_purchases = []
past_claims = []
past_claims_value = []

for i in range(0, len(df)):
    date = df['date'][i]
    customer_ID = df['customer_ID'][i]
    df_temp = df[(df['date'] < date) & (df['customer_ID'] == customer_ID)]
    past_purchases.append(len(df_temp))
    past_claims.append(df_temp['claim'].sum())
    past_claims_value.append(df['claim_value'].sum())

df['past_purchases'] = pd.DataFrame(past_purchases)
df['past_claims'] = pd.DataFrame(past_claims)
df['past_claims_value'] = pd.DataFrame(past_claims_value)

The code works fine, but it's too slow. Can anyone make it work faster? Thanks!

Ps: It's importante to check that the date is older, if the customer had 2 purchases in the same date they shouldn't count for each other.

Pss: I'm willing to use libraries for parallel processing like multiprocessing, concurrent.futures, joblib or dask, but never had before in a similar way.

Expected outcome:

hi Adrian, kindly share a sample dataset with ur expected output — sammywemmy, Apr 12 '20 at 07:48
hi Adrian, kindly have a look at this [guide](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples). you should share data, not pics, and also include ur expected output as well. visuals help a lot and allows more responders to attempt ur question — sammywemmy, Apr 12 '20 at 08:28

jcaliz · Answer 1 · 2020-04-12T20:07:00.257

0

Maybe you can try using a cumsum over customers, if the dates are sorted ascendant

df.sort_values('date', inplace=True)

new_temp_columns = ['claim_s','claim_value_s']
df[['claim_s','claim_value_s']] = df[new_temp_columns].shift()
df['past_claims'] = df.groupby('customer_ID')['claim_s'].transform(pd.Series.cumsum)
df['past_claims_value'] = df.groupby('customer_ID')['claim_value_s'].transform(pd.Series.cumsum)

# set the min value for the groups
dfc = data.groupby(['customer_ID','date'])[['past_claims','past_claims_value']]
data[['past_claims', 'past_claims_value']] = dfc.transform(min)

# Remove temp columns     
data = data.loc[:, ~data.columns.isin(new_temp_columns)]

Again, this will only works if te date are srotes

edited Apr 12 '20 at 20:07

answered Apr 12 '20 at 07:56

jcaliz

3,891
2
9
13

Hi jcaliz, thanks for the fast answer. I think this doesn't help for the case when a customer purchases two times in the same day (they shouldn't count for each other as past purchases nor past claims) – Adrian Torrejón Apr 12 '20 at 08:01
You can set the values of the group (same `date` and same `user`) to have the `cumsum` to the group's min but I think is overcomplicating the procedure, let me know if this works, I have added an edit. – jcaliz Apr 12 '20 at 20:07

Faster loop in Pandas looking for ID and older date

1 Answers1