0

I would have the following question:

I have about 30 million rows of data. The data has six columns:

DISTINCT_IRECIPIENTID | ORDERNUMBER | ORDERDATE | ORDERDATE_OF_NEXT_ORDER | RETAINED_OR_NOT

"RETAINED_OR_NOT" has three categories:

  • "Retained for one year" if the difference between ORDERDATE and ORDERDATE_OF_NEXT_ORDER is <=365 days,

  • Next_purchase_but_not_retained if the difference between ORDERDATE -ORDERDATE_OF_NEXT_ORDER is >365 days

  • "Only one lifetime purchase" if that person only placed one lifetime order

I now want to calculate the # of years a consumer is retained in a row.

So if a customer placed four orders, and 3 consecutive ones are each paced at a time difference of <=1 year while the difference between the 3rd and the 4th is > 1 year then the value should be counted as 3 for the first three ones and 0 for the last one.

The data is already ordered by DISTINCT_IRECIPIENTID by ORDERDATE DESC

I basically wrote the following code, but the execution is extremely slow. Would anyone have some ideas on how to make the code more efficient?

def find_consecutive_purchases_in_a_year(input):
    count = 0
    exit_loop = 0
    sub_data = prepared_main_data_backup[prepared_main_data_backup['DISTINCT_IRECIPIENTID']
                                         == input]

    for index, row in sub_data.iterrows():

        if exit_loop == 1:
            return count
        if exit_loop == 0:
            if row['RETAINED_OR_NOT'] == 'retained_for_one_year':
                count += 1
            else:
                exit_loop = 1
    return count


data_test = prepared_main_data_backup

data_test['retain_counter'] = data_test['DISTINCT_IRECIPIENTID'].apply(
    find_consecutive_purchases_in_a_year)

Please find below some sample data.


    DISTINCT_IRECIPIENTID   TSORDERDATETIME FIRST_TRANS_DATE    ORDER_DATE_AFTER    DIFFERENCE_BETWEEN_ORDERS   RETAINED_OR_NOT Output
    1   2017-04-24-09.33.21.000000  2017-04-24-09.33.21.000000          only one lifetime purchase  0
    2   2017-04-24-09.35.16.000000  2017-04-24-09.35.16.000000          only one lifetime purchase  0
    3   2017-04-27-14.45.48.000000  2017-04-27-14.45.48.000000  2017-04-29-14.53.46.000000  2   retained_for_one_year   2
    3   2017-04-29-14.53.46.000000  2017-04-27-14.45.48.000000  2017-05-10-09.06.25.000000  11  retained_for_one_year   2
    3   2017-05-10-09.06.25.000000  2017-04-27-14.45.48.000000  2018-09-22-05.54.07.000000  500 next_purchase_but_not_retained  0
    3   2018-09-22-05.54.07.000000  2017-04-27-14.45.48.000000  2020-09-12-19.12.59.000000  721 next_purchase_but_not_retained  0
    3   2020-09-12-19.12.59.000000  2017-04-27-14.45.48.000000  2020-09-14-11.49.33.000000  2   retained_for_one_year   2
    3   2020-09-14-11.49.33.000000  2017-04-27-14.45.48.000000  2021-06-08-07.18.42.000000  267 retained_for_one_year   2
    4   2017-04-24-09.35.27.000000  2017-04-24-09.35.27.000000  2017-04-30-12.00.14.000000  6   retained_for_one_year   1
    4   2017-04-30-12.00.14.000000  2017-04-24-09.35.27.000000  2018-06-18-09.15.23.000000  414 next_purchase_but_not_retained  0




Any input is much appreciated!

Horstus
  • 13
  • 3
  • 1
    Can you please edit your question and put there sample input (small) and expected output? – Andrej Kesely Sep 07 '21 at 16:50
  • Just added a picture, I hope it helps! – Horstus Sep 07 '21 at 17:35
  • @Horstus, a picture would not help others reproduce your question and attempt to solve it. Ill request if you can add a sample of the dataset itself (say 5-10 rows). Simply `print(df.head(10))` and copy the complete output. Paste it in a code block. – Akshay Sehgal Sep 07 '21 at 17:37
  • @Horstus Can you post the data as text (so we can copy-paste it)? You can use `Ctrl`-`K` for formatting. – Andrej Kesely Sep 07 '21 at 17:38
  • Two comments: 1) The function runs on every row in your dataframe. Inside that function, you filter the entire dataframe again. This means that run time will go up as the square of the number of rows, which is inefficient. 2) I believe this is equivalent to a groupby followed by filtering for non 'retained_for_one_year,' followed by idxmax, followed by merging the result back into the dataframe. That should be faster. I'd make a more detailed example, but you [haven't posted your data](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples). – Nick ODell Sep 07 '21 at 17:45
  • @NickODell: Just added some sample data too – Horstus Sep 08 '21 at 07:02
  • Anyone would have some idea? – Horstus Sep 14 '21 at 16:18

0 Answers0