2

Background

I am using the Cox proportional hazards model to investigate the effects of several time-varying covariates on tree mortality, recorded annually. Time-varying covariates are annual climate variables, meaning they are the same for each subject at time t but differ between times for each subject. I want to know how drought contributes to tree mortality hazard, but drought values are the same for each subject at time t when the data are arranged in counting process format (i.e. each row is one year, subjects have multiple rows of data until final row indicates year of death) shown by data setup.

As per the instructions listed by the Surv() documentation, both start and end times must be considered for counting-process style data, even though my time intervals are only 1 time step apart (1:2, 2:3, 3:4, and so on).

fit1 <- coxph(Surv(tstart,tstop,status) ~ x1 + x2, data = datt)

This creates the problem discussed here, with error message: 'X matrix deemed to be singular'. At this link, Dr. Therneau explains that "The Cox model compares the values of the covariates of each subject who died to the values of those who did not die, using the current covariate values AT THAT TIME. Since the value of your "t" is always a constant within the set, the variable contains no information for discriminating the events from the non-events. Zero information --> a coefficient of NA. "

Getting to the real question here

I can get around this singularity problem simply through a coding trick that assumes right censoring rather than interval censoring, which I feel is somewhat okay doing considering my intervals contain only one time-step each.

Because the data are still in counting-process style, however, each subject has multiple rows of data; I account for this by adding the cluster(id) argument to the formula (though the cluster argument is not needed to 'solve' the singularity problem).

fit2 <- coxph(Surv(tstop,status) ~ x1 + x2 + cluster(id), data = datt)

In other words, the apparently perfect classification indicated by the error message is fixed simply by considering time as points rather than intervals. It therefore appears that Surv() looks at covariate values 'at that time' when time is an interval, but not when time is a point.

While this method produces intuitive results (e.g., hazard ratio for drought variable > 1), the assumption of proportionality is violated and I am still left with the feeling that I haven't 'tricked' Surv() at all, but am creating some spurious results (albeit very intuitive results).

My questions follow

1) how is Surv() handling these two scenarios such that the result is different? My impression was based on Dr. Therneau's vignette statement, "the likelihood equations at any time point use only one copy of any subject, the program picks out the correct row of data at each time" as described here, but if this is the case, shouldn't I be getting the same error for both fits?

2) is considering my data as right-censored and accounting for clustering of subjects an acceptable work-around for this problem, or is the code doing something unintended?

3) if my thinking about this is entirely wrong, is there a better way to include these annual climate variables so I can address my research question?

4) if my thinking about this is not entirely wrong, and I go on with fit2 as described, would I need to add tt() arguments for all variable interactions that include my non-proportional-hazards-inducing variable, drought?

fit3 <- coxph(Surv(tstop,status) ~
     x1 + x2 + tt(x2) + x1*x2 + tt(x1*x2) + cluster(id), 
   data = datt, tt = function(x,t,...) x*log(t))

Thank you so much for your time, and forgive me if this question is better suited to Cross Validated (this is my first question post, after all).

Ben Bolker
  • 211,554
  • 25
  • 370
  • 453
  • I don't know the answer, but one way to cross-check your results would be to simulate data with known covariate effects and see if the results of your suggested analytic method are correct ... – Ben Bolker Apr 17 '18 at 23:40
  • Thank you for your response, Ben. You make a good point - do you have any tips (or good references / vignettes) for how to simulate data with known covariate effects? I've never done that before. – Sara Germain Apr 17 '18 at 23:49
  • I think I would actually try to do this by binomial GLM with cloglog link. I believe that for interval-censored models of this type, a binomial/cloglog GLM with fixed effects of time is actually equivalent to Cox PH (can't find reference though ...) – Ben Bolker Apr 18 '18 at 23:03

0 Answers0