4

I'm using survival analysis to evaluate the relative distance (instead of time, as it's usually the case with survival statistics) before a given event happened. As the dataset I'm working with is quite big, you can download the .rds file of my dataset here

When modeling the relative distance using survreg(), I encountered NaN and Inf z and p values (presumably deriving from 0 values of Std Error) in the model summary:

Call:
survreg(formula = Surv(RelDistance, Status) ~ Backshore + LowerBSize + 
    I(LowerBSize^2) + I(LowerBSize^3) + State, data = DataLong, 
    dist = "exponential")
                            Value Std. Error        z         p
(Intercept)                   2.65469   1.16e-01  22.9212 2.85e-116
BackshoreDune                -0.08647   9.21e-02  -0.9387  3.48e-01
BackshoreForest / Tree (>3m) -0.07017   0.00e+00     -Inf  0.00e+00
BackshoreGrass - pasture     -0.79275   1.63e-01  -4.8588  1.18e-06
BackshoreGrass - tussock     -0.14687   1.00e-01  -1.4651  1.43e-01
BackshoreMangrove             0.08207   0.00e+00      Inf  0.00e+00
BackshoreSeawall             -0.47019   1.43e-01  -3.2889  1.01e-03
BackshoreShrub (<3m)         -0.14004   9.45e-02  -1.4815  1.38e-01
BackshoreUrban / Building     0.00000   0.00e+00      NaN       NaN
LowerBSize                   -0.96034   1.96e-02 -49.0700  0.00e+00
I(LowerBSize^2)               0.06403   1.87e-03  34.2782 1.66e-257
I(LowerBSize^3)              -0.00111   3.84e-05 -28.8070 1.75e-182
StateNT                       0.14936   0.00e+00      Inf  0.00e+00
StateQLD                      0.09533   1.02e-01   0.9384  3.48e-01
StateSA                       0.01030   1.06e-01   0.0973  9.22e-01
StateTAS                      0.19096   1.26e-01   1.5171  1.29e-01
StateVIC                     -0.07467   1.26e-01  -0.5917  5.54e-01
StateWA                       0.24800   9.07e-02   2.7335  6.27e-03

Scale fixed at 1 

Exponential distribution
Loglik(model)= -1423.4   Loglik(intercept only)= -3282.8
    Chisq= 3718.86 on 17 degrees of freedom, p= 0 
Number of Newton-Raphson Iterations: 6 
n= 6350 

I thought the Inf and NaN could be caused by data separation, and merged some levels of Backshore together:

levels(DataLong$Backshore)[levels(DataLong$Backshore)%in%c("Grass - 
pasture", "Grass - tussock", "Shrub (<3m)")] <- "Grass - pasture & tussock 
/ Shrub(<3m)"
levels(DataLong$Backshore)[levels(DataLong$Backshore)%in%c("Seawall", 
"Urban / Building")] <- "Anthropogenic"
levels(DataLong$Backshore)[levels(DataLong$Backshore)%in%c("Forest / Tree 
(>3m)", "Mangrove")] <- "Tree(>3m) / Mangrove"

but the issue persists when running the model again (i.e. Backshore Tree(>3m) / Mangrove).

Call:
survreg(formula = Surv(RelDistance, Status) ~ Backshore + LowerBSize + 
    I(LowerBSize^2) + I(LowerBSize^3) + State, data = DataLong, 
    dist = "exponential")
                                              Value Std. Error       z         p
(Intercept)                                      2.6684   1.18e-01  22.551 1.32e-112
BackshoreDune                                   -0.1323   9.43e-02  -1.402  1.61e-01
BackshoreTree(>3m) / Mangrove                   -0.0530   0.00e+00    -Inf  0.00e+00
BackshoreGrass - pasture & tussock / Shrub(<3m) -0.2273   8.95e-02  -2.540  1.11e-02
BackshoreAnthropogenic                          -0.5732   1.38e-01  -4.156  3.24e-05
LowerBSize                                      -0.9568   1.96e-02 -48.920  0.00e+00
I(LowerBSize^2)                                  0.0639   1.87e-03  34.167 7.53e-256
I(LowerBSize^3)                                 -0.0011   3.84e-05 -28.713 2.59e-181
StateNT                                          0.2892   0.00e+00     Inf  0.00e+00
StateQLD                                         0.0715   1.00e-01   0.713  4.76e-01
StateSA                                          0.0507   1.05e-01   0.482  6.30e-01
StateTAS                                         0.1990   1.26e-01   1.581  1.14e-01
StateVIC                                        -0.0604   1.26e-01  -0.479  6.32e-01
StateWA                                          0.2709   9.05e-02   2.994  2.76e-03

Scale fixed at 1 

Exponential distribution
Loglik(model)= -1428.4   Loglik(intercept only)= -3282.8
    Chisq= 3708.81 on 13 degrees of freedom, p= 0 
Number of Newton-Raphson Iterations: 6 
n= 6350 

I've looked for an explanation for this behaviour pretty much everywhere in the survival package documentation and online, but I couldn't find anything that related to this.

Does anyone know what could be the cause of Inf and NaNs in this case?

Ben Bolker
  • 211,554
  • 25
  • 370
  • 453
  • @BenBolker thanks. I previously tested for Cramer's V, but the problem peristed even when excluding either the State or Backshore (which are the two variables that shoewd the largest Cramer's V) from the model. I don't get what you mean when referring to the Australian states and territories, how is it relevant with this issue? – Ari Olivelli Jan 07 '19 at 01:23
  • I was wrong .... – Ben Bolker Jan 07 '19 at 01:24

2 Answers2

3

The covariate LowerBSize perfectly predicts the Status outcome; Status==0 only if LowerBSize==0 and Status==1 only if LowerBSize>0.

table(DataLong$LowerBSize, DataLong$Status)

          0    1   
  0    4996    0
  1.2     0  271
  2.4     0  331
  4.9     0  256
  9.6     0  155
  19.2    0  148
  36.3    0  193

A convenient way to consider LowerBSize in the model is to include the binary variable LowerBSize>0:

survreg(formula = Surv(RelDistance, Status) ~ Backshore +  State + 
        I(LowerBSize>0), data = DataLong,  dist = "exponential")

Coefficients:
                 (Intercept)                BackshoreDune BackshoreForest / Tree (>3m) 
                 22.97248461                  -0.04798348                  -0.27440059 
    BackshoreGrass - pasture     BackshoreGrass - tussock            BackshoreMangrove 
                 -0.33624746                  -0.07545700                   0.12020217 
            BackshoreSeawall         BackshoreShrub (<3m)    BackshoreUrban / Building 
                 -0.01008893                  -0.05115076                   0.29125024 
                     StateNT                     StateQLD                      StateSA 
                  0.15385826                   0.11617931                   0.08405151 
                    StateTAS                     StateVIC                      StateWA 
                  0.14914393                   0.08803225                   0.06395311 
       I(LowerBSize > 0)TRUE 
                -23.75967069 

Scale fixed at 1 

Loglik(model)= -316.5   Loglik(intercept only)= -3282.8
        Chisq= 5932.66 on 15 degrees of freedom, p= <2e-16 
n= 6350
Marco Sandri
  • 23,289
  • 7
  • 54
  • 58
  • I upvoted, but I think this isn't the whole story (see my answer). – Ben Bolker Jan 07 '19 at 01:22
  • Thanks for the input. LowerBSize represents the size of the object found after a certain relative distance, therefore if no object was found (Status = 0), LowerBSize is also 0. I consider LowerBsize to be an important factor to predict RelDistance when Status = 1, so I'm not sure about dropping the variable completely. – Ari Olivelli Jan 07 '19 at 01:48
2

@MarcoSandri is correct that censoring is confounded with LowerBSize, but I'm not sure that's the entire solution. It could explain why the model is so unstable, but that by itself shouldn't (AFAICT) make the model ill-posed. If I replace LowerBSize+ I(LowerBSize^2) + I(LowerBSize^3) with an orthogonal polynomial (poly(LowerBSize,3)) I get more reasonable-looking answers:

ss3 <- survreg(formula = Surv(RelDistance, Status) ~ Backshore +
                   poly(LowerBSize,3) + State, data = DataLong, 
               dist = "exponential")

Call:
survreg(formula = Surv(RelDistance, Status) ~ Backshore + poly(LowerBSize, 
    3) + State, data = DataLong, dist = "exponential")
                                 Value Std. Error      z       p
(Intercept)                   2.18e+00   1.34e-01  16.28 < 2e-16
BackshoreDune                -1.56e-01   1.06e-01  -1.47 0.14257
BackshoreForest / Tree (>3m) -2.24e-01   2.01e-01  -1.11 0.26549
BackshoreGrass - pasture     -8.63e-01   1.74e-01  -4.97 6.7e-07
BackshoreGrass - tussock     -2.14e-01   1.13e-01  -1.89 0.05829
BackshoreMangrove             3.66e-01   4.59e-01   0.80 0.42519
BackshoreSeawall             -5.37e-01   1.53e-01  -3.51 0.00045
BackshoreShrub (<3m)         -2.08e-01   1.08e-01  -1.92 0.05480
BackshoreUrban / Building    -1.17e+00   3.22e-01  -3.64 0.00028
poly(LowerBSize, 3)1         -6.58e+01   1.41e+00 -46.63 < 2e-16
poly(LowerBSize, 3)2          5.09e+01   1.19e+00  42.72 < 2e-16
poly(LowerBSize, 3)3         -4.05e+01   1.41e+00 -28.73 < 2e-16
StateNT                       2.61e-01   1.93e-01   1.35 0.17557
StateQLD                      9.72e-02   1.12e-01   0.87 0.38452
StateSA                      -4.11e-04   1.15e-01   0.00 0.99715
StateTAS                      1.91e-01   1.35e-01   1.42 0.15581
StateVIC                     -9.55e-02   1.35e-01  -0.71 0.47866
StateWA                       2.46e-01   1.01e-01   2.44 0.01463

If I fit exactly the same model but with poly(LowerBSize,3,raw=TRUE) (calling the result ss4, see below) I get your pathologies again. Furthermore, the model with orthogonal polynomials actually fits better (has a higher log-likelihood):

logLik(ss4)
## 'log Lik.' -1423.382 (df=18)
logLik(ss3)
## 'log Lik.' -1417.671 (df=18)

In a perfect mathematical/computational world, this shouldn't be true - it's another indication that there's something unstable about specifying the LowerBSize effects this way. I'm a little surprised this happens - the number of unique values of LowerBSize is small but shouldn't be pathological, and the range of values isn't huge or far from zero ...


I still can't say what's really causing this, but the proximal problem is probably the strong correlation between the linear/quadratic/cubic terms: for something simpler like linear regression a correlation of 0.993 (between quad & cubic terms) doesn't cause severe problems, but the more complicated the numerical problem (e.g. survival analysis vs. linear regression), the more correlation can be an issue ...

X <- model.matrix( ~ Backshore + LowerBSize + 
                       I(LowerBSize^2) + I(LowerBSize^3) + State,
                  data=DataLong)
print(cor(X[,grep("LowerBSize",colnames(X))]),digits=3)
library(corrplot)
png("survcorr.png")
corrplot.mixed(cor(X[,-1]),lower="ellipse",upper="number",
             tl.cex=0.4)
dev.off()

enter image description here

Ben Bolker
  • 211,554
  • 25
  • 370
  • 453
  • Thank you Ben for your interesting answer. What about adding `I(LowerBSize>0)` to the model ? (See my edited answer). – Marco Sandri Jan 07 '19 at 01:54
  • 1
    It's not unreasonable, but it potentially leaves out a lot of interesting information - if you look at coefficients in my summary above, the quadratic and cubic terms seem to be pretty important. As I said, I don't think there should be anything fundamentally wrong with having the censoring confounded with a covariate ... In fact, I'd consider incorporating it in the model as an ordinal variable (i.e., `ordered` factor), so that *all* of the info in the covariate can be considered (since the data set is large and there are only 7 unique values ...) – Ben Bolker Jan 07 '19 at 02:00
  • Thank you Ben, this is really interesting. I was originally using an `ordered` factor, but I wanted to simplify the model, thus I introduced a `numeric` variable instead. I still don't get the theory behind this problem, though. Do you know why switching to a `numeric` variable causes it? Eventually, even with an `ordered` factor, all the obervations with `Status == 0` would have `Size == 0`, and those with `Status == 1` would have `Size>0`. – Ari Olivelli Jan 07 '19 at 02:27
  • 1
    I really don't think the censoring-confounding in @MarcoSandri's answer is the issue. It's more to do with correlation of predictor variables (see updates) – Ben Bolker Jan 07 '19 at 02:39
  • When running the model again, I see the problem persists if I include the linear and cubic or the quadratic and cubic terms together, while it seems to work if I include linear and quadratic (regardless of the high correlation). In fact, the correlation between linear and quad is higher than that between linear and cubic. I understand this could be the key to solving the issue, but why is it not problematic when including linear and quadratic then? – Ari Olivelli Jan 07 '19 at 04:45
  • I don't really think you need to "solve" the problem; what's wrong with the orthogonal polynomial? The only difficulty would be if you actually wanted to be able to interpret the numerical values of the linear/quadratic/cubic terms in the model. In this case, I would suggest mean-centering the `LowerBSize` variable and using raw polynomials - won't be quite as stable as ortho polynomials, but will probably solve the problem, and would be interpretable (and easier to back-transform to the original scale if you did want). – Ben Bolker Jan 07 '19 at 22:30