I am using Lifetimes
to compute CLV of some customers of mine.
I have transactional data and, by means of summary_data_from_transaction_data
(the implementation can be found here) I would like to compute
the recency, the frequency and the time interval T of each customer.
Unfortunately, it seems that the method does not compute correctly the frequency.
Here is the code for testing my dataset:
df_test = pd.read_csv('test_clv.csv', sep=',')
RFT_from_libray = summary_data_from_transaction_data(df_test,
'Customer',
'Transaction date',
observation_period_end='2020-02-12',
freq='D')
According to the code, the result is:
frequency recency T
Customer
1158624 18.0 389.0 401.0
1171970 67.0 396.0 406.0
1188564 12.0 105.0 401.0
The problem is that customer 1188564
and customer 1171970
did respectively 69 and 14 transaction, thus the frequency should have been 68 and 13.
Printing the size of each customer confirms that:
print(df_test.groupby('Customer').size())
Customer
1158624 19
1171970 69
1188564 14
I did try to use natively the underlying code in the summary_data_from_transaction_data
like this:
RFT_native = df_test.groupby('Customer', sort=False)['Transaction date'].agg(["min", "max", "count"])
observation_period_end = (
pd.to_datetime('2020-02-12', format=None).to_period('D').to_timestamp()
)
# subtract 1 from count, as we ignore their first order.
RFT_native ["frequency"] = RFT_native ["count"] - 1
RFT_native ["T"] = (observation_period_end - RFT_native ["min"]) / np.timedelta64(1, 'D') / 1
RFT_native ["recency"] = (RFT_native ["max"] - RFT_native ["min"]) / np.timedelta64(1, 'D') / 1
As you can see, the result is indeed correct.
min max count frequency T recency
Customer
1171970 2019-01-02 15:45:39 2020-02-02 13:40:18 69 68 405.343299 395.912951
1188564 2019-01-07 18:10:55 2019-04-22 14:27:08 14 13 400.242419 104.844595
1158624 2019-01-07 10:52:33 2020-01-31 13:50:36 19 18 400.546840 389.123646
Of course my dataset is much bigger, and a slight difference in my frequency and/or recency alters a lot the computation of the BGF model.
What am I missing? Is there something that I should consider when using the method?