2

I am using lifelines library to estimate Cox PH model. For the regression I have many categorical features, which I one-hot-encode and remove one column per feature to avoid multicollinearity issue (dummy variable trap). I am not attaching the code as the example can be similar to the one given in the documentation here.

By running cph.check_assumptions(data) I receive information that each dummy variable violates the assumptions:

Variable 'dummy_a' failed the non-proportional test: p-value is 0.0063.
Advice: with so few unique values (only 2), you can try `strata=['dummy_a']` in the call in `.fit`. See documentation in link [A] and [B] below.

How should I understand the advice in terms of multiple dummy variables for a single categorical feature? Should I add them all to strata?

I will appreciate any comments :)

abu
  • 737
  • 5
  • 8
  • 19

1 Answers1

1

@abu, your question brings up a clear gap in the documentation - what to do if dummy variables violate the proportional test. In this case, I suggest not dummying the variable, and add the original column as a stratified variable, ex: fit(..., strata=['dummy'])

Cam.Davidson.Pilon
  • 1,606
  • 1
  • 17
  • 31
  • Thanks for you reply! So as far as I understand I should use the categorical variable in the strata. Could you please explain briefly the reasoning behind it? What is the advantage of doing so over using dummies? – abu Mar 06 '19 at 15:12
  • Because it also makes if difficult to interpret the results when I include a few categorical features (each with several levels) in strata. – abu Mar 06 '19 at 15:20
  • The same information is in the categorical variable vs the dummied variable, so I guess you can use either. The former is slightly simpler to implement, I feel. – Cam.Davidson.Pilon Mar 06 '19 at 17:18
  • What I meant is why include it in the strata as opposed to including it as a feature and having coefficients for each level. – abu Mar 07 '19 at 09:47
  • well, the model with the features violates the PH assumption, so in reality, the hazard ratio isn't constant. Thus a (constant) coefficient can't capture all the information about time-varying hazard ratio. If you are only interested in prediction, then it doesn't really matter. – Cam.Davidson.Pilon Mar 07 '19 at 14:51
  • I am interested in both survival prediction as well as which features turn out to impact it significantly. I will follow your advice, thanks. – abu Mar 07 '19 at 15:09