Lifetimes package gives inconsistent results

Question

I am using Lifetimes to compute CLV of some customers of mine. I have transactional data and, by means of summary_data_from_transaction_data (the implementation can be found here) I would like to compute the recency, the frequency and the time interval T of each customer. Unfortunately, it seems that the method does not compute correctly the frequency.
Here is the code for testing my dataset:

df_test = pd.read_csv('test_clv.csv', sep=',')
RFT_from_libray = summary_data_from_transaction_data(df_test, 
                                                     'Customer', 
                                                     'Transaction date',
                                                      observation_period_end='2020-02-12',
                                                      freq='D')

According to the code, the result is:

          frequency  recency      T
Customer
1158624        18.0    389.0  401.0
1171970        67.0    396.0  406.0
1188564        12.0    105.0  401.0

The problem is that customer 1188564 and customer 1171970 did respectively 69 and 14 transaction, thus the frequency should have been 68 and 13. Printing the size of each customer confirms that:

print(df_test.groupby('Customer').size())

Customer
1158624    19
1171970    69
1188564    14

I did try to use natively the underlying code in the summary_data_from_transaction_data like this:

RFT_native = df_test.groupby('Customer', sort=False)['Transaction date'].agg(["min", "max", "count"])
observation_period_end = (
            pd.to_datetime('2020-02-12', format=None).to_period('D').to_timestamp()
        )
# subtract 1 from count, as we ignore their first order.
RFT_native ["frequency"] = RFT_native ["count"] - 1
RFT_native ["T"] = (observation_period_end - RFT_native ["min"]) / np.timedelta64(1, 'D') / 1
RFT_native ["recency"] = (RFT_native ["max"] - RFT_native ["min"]) / np.timedelta64(1, 'D') / 1

As you can see, the result is indeed correct.

                         min                 max  count  frequency           T     recency
Customer
1171970  2019-01-02 15:45:39 2020-02-02 13:40:18     69         68  405.343299  395.912951
1188564  2019-01-07 18:10:55 2019-04-22 14:27:08     14         13  400.242419  104.844595
1158624  2019-01-07 10:52:33 2020-01-31 13:50:36     19         18  400.546840  389.123646

Of course my dataset is much bigger, and a slight difference in my frequency and/or recency alters a lot the computation of the BGF model.

What am I missing? Is there something that I should consider when using the method?

score 2 · Answer 1 · edited May 26 '20 at 11:53

I might be a bit late to answer your query, but here it goes.

The documentation for the Lifestyles package defines frequency as:

frequency represents the number of repeat purchases the customer has made. This means that it’s one less than the total number of purchases. This is actually slightly wrong. It’s the count of time periods the customer had a purchase in. So if using days as units, then it’s the count of days the customer had a purchase on.

So, it's basically the number of time periods when the customer has made a repeat purchase, not the number of individual repeat purchases. A quick scan of your sample dataset confirmed that both 1188564 and 1171970 indeed made 2 purchases on a single day, 13Jan2019 and 15Jun2019, respectively. So these 2 transactions would be considered as 1 when calculating frequency that would result in the frequency calculated by summary_data_from_transaction_data function to be 2 less than your manual count.

score 0 · Answer 2 · answered May 07 '22 at 05:54

According to documentation, you need to set:

include_first_transaction = True

include_first_transaction (bool, optional) – Default: False By default the first transaction is not included while calculating frequency and monetary_value. Can be set to True to include it. Should be False if you are going to use this data with any fitters in lifetimes package

Lifetimes package gives inconsistent results

2 Answers2