2

I am trying to predict the survival score and LTV for contractual and discrete policies(in insurance) in python. I browsed number of sites but I could find many examples only for non-contractual(in retail).
I have used the below code:

from lifelines import CoxPHFitter
#After all feature selection and EDA

cph_train, cph_test = train_test_split(features, test_size=0.2)

cph = CoxPHFitter()
cph.fit(cph_train, 'TIME', event_col='EVENT')
cph.print_summary()

Where TIME - is number of days between the policy taken date and current date for ACTIVE customers and between policy taken date and surrendered date for nonACTIVE customers.
EVENT - is the indicator for whether the customer is ACTIVE or not ACTIVE.

After fitting the model I got concordance of 0.7(which I feel is OK).
From here on how do I proceed to get survival score for the ACTIVE customers and lifetime value(CLTV)? Bascially I need to predict who are the valuable customers who will stay for long with the company.

I have added some code by going thru some posts and suggestions by Cam.

censored_subjects = features.loc[features['EVENT'] == 1] #Selecting only the ACTIVE ones

unconditioned_sf = cph.predict_survival_function(censored_subjects)

conditioned_sf = unconditioned_sf.apply(lambda c: (c / c.loc[features.loc[c.name, 'TIME']]).clip_upper(1)) 

predictions_75 = qth_survival_times(.75, conditioned_sf)
predictions_50 = qth_survival_times(.50, conditioned_sf)

values = predictions_75.T.join(data[['PREAMT','TIME']])
values50 = predictions_50.T.join(data[['PREAMT','TIME']])
values['RemainingValue'] = values['PREAMT'] * (values[0.75] - values['TIME'])

So what does the output denote:
0.5 PREAMT TIME --- The number in column 0.5 does it denotes the duration for which there is 50% chance for getting closed?
0.75 PREAMT TIME --- Similarly 0.75 denotes the duration for which there is 75% chance for getting closed?
RemainingValue --- Is it the remaining amount to be paid?

And what is the next step post-this?

1 Answers1

3

Where TIME - is number of days between the policy taken date and current date for ACTIVE customers and between policy taken date and surrendered date for nonACTIVE customers. EVENT - is the indicator for whether the customer is ACTIVE or not ACTIVE.

Makes sense to me.

After fitting the model I got concordance of 0.7(which I feel is OK).

That's an acceptable score for survival models But also try AFT models, these might perform better (also try modelling all the parameters).


So what you need to do next is to predict future lifetime of customers given they have survived t periods. There are some docs on exactly this application. Note the same code applies to AFT models too.

You can choose to predict the median, or survival curve. If your goal is CLV, I think predicting the survival curve is more appropriate, as you can model varying policy rates (sorry I don't know the correct terminology). For example, using the code in the docs:

times = np.arange(1000) # predict far out, since we don't want to truncate the survival curve prematurely. 
unconditioned_sf = cph.predict_survival_function(censored_subjects, times=times)

conditioned_sf = unconditioned_sf.apply(lambda c: (c / c.loc[df.loc[c.name, 'T']]).clip_upper(1))

# simple case, users pay $30 a month (and your units of survival function are "months"
CLV = (30 * conditioned_sf).sum(0)

# more complicated: they each have a different "rate"
CLV = conditioned_sf.sum(0) * rate_by_user

# and so on...
Cam.Davidson.Pilon
  • 1,606
  • 1
  • 17
  • 31
  • Thanks for sharing your thoughts on this. I have added some code by going thru some posts and suggested by you. Modifications I have added in the question itself. I would like the know the output significance(which has been mentioned in the question). And secondly what is the next step post this? Or is it the LTV? – ggg_datascience Jun 06 '19 at 15:22