I would have the following question:
I have about 30 million rows of data. The data has six columns:
DISTINCT_IRECIPIENTID | ORDERNUMBER | ORDERDATE | ORDERDATE_OF_NEXT_ORDER | RETAINED_OR_NOT
"RETAINED_OR_NOT" has three categories:
"Retained for one year" if the difference between ORDERDATE and ORDERDATE_OF_NEXT_ORDER is <=365 days,
Next_purchase_but_not_retained if the difference between ORDERDATE -ORDERDATE_OF_NEXT_ORDER is >365 days
"Only one lifetime purchase" if that person only placed one lifetime order
I now want to calculate the # of years a consumer is retained in a row.
So if a customer placed four orders, and 3 consecutive ones are each paced at a time difference of <=1 year while the difference between the 3rd and the 4th is > 1 year then the value should be counted as 3 for the first three ones and 0 for the last one.
The data is already ordered by DISTINCT_IRECIPIENTID by ORDERDATE DESC
I basically wrote the following code, but the execution is extremely slow. Would anyone have some ideas on how to make the code more efficient?
def find_consecutive_purchases_in_a_year(input):
count = 0
exit_loop = 0
sub_data = prepared_main_data_backup[prepared_main_data_backup['DISTINCT_IRECIPIENTID']
== input]
for index, row in sub_data.iterrows():
if exit_loop == 1:
return count
if exit_loop == 0:
if row['RETAINED_OR_NOT'] == 'retained_for_one_year':
count += 1
else:
exit_loop = 1
return count
data_test = prepared_main_data_backup
data_test['retain_counter'] = data_test['DISTINCT_IRECIPIENTID'].apply(
find_consecutive_purchases_in_a_year)
Please find below some sample data.
DISTINCT_IRECIPIENTID TSORDERDATETIME FIRST_TRANS_DATE ORDER_DATE_AFTER DIFFERENCE_BETWEEN_ORDERS RETAINED_OR_NOT Output
1 2017-04-24-09.33.21.000000 2017-04-24-09.33.21.000000 only one lifetime purchase 0
2 2017-04-24-09.35.16.000000 2017-04-24-09.35.16.000000 only one lifetime purchase 0
3 2017-04-27-14.45.48.000000 2017-04-27-14.45.48.000000 2017-04-29-14.53.46.000000 2 retained_for_one_year 2
3 2017-04-29-14.53.46.000000 2017-04-27-14.45.48.000000 2017-05-10-09.06.25.000000 11 retained_for_one_year 2
3 2017-05-10-09.06.25.000000 2017-04-27-14.45.48.000000 2018-09-22-05.54.07.000000 500 next_purchase_but_not_retained 0
3 2018-09-22-05.54.07.000000 2017-04-27-14.45.48.000000 2020-09-12-19.12.59.000000 721 next_purchase_but_not_retained 0
3 2020-09-12-19.12.59.000000 2017-04-27-14.45.48.000000 2020-09-14-11.49.33.000000 2 retained_for_one_year 2
3 2020-09-14-11.49.33.000000 2017-04-27-14.45.48.000000 2021-06-08-07.18.42.000000 267 retained_for_one_year 2
4 2017-04-24-09.35.27.000000 2017-04-24-09.35.27.000000 2017-04-30-12.00.14.000000 6 retained_for_one_year 1
4 2017-04-30-12.00.14.000000 2017-04-24-09.35.27.000000 2018-06-18-09.15.23.000000 414 next_purchase_but_not_retained 0
Any input is much appreciated!