4

How can I plot survival curves for representative values of a continuous covariate in a cox proportional hazards model? Specifically, I would like to do this in ggplot using a "survfit.cox" "survfit" object.

This may seem like a question that has already been answered, but I have searched through everything in SO with the terms 'survfit' and 'newdata' (plus many other search terms). This is the thread that comes closest to answering my question so far: Plot Kaplan-Meier for Cox regression

In keeping with the reproducible example offered in one of the answers to that post:

url <- "http://socserv.mcmaster.ca/jfox/Books/Companion/data/Rossi.txt"
df <- read.table(url, header = TRUE)

library(dplyr)
library(ggplot2)
library(survival)
library(magrittr)
library(broom)

# Identifying the 25th and 75th percentiles for prio (continuous covariate)

summary(df$prio)

# Cox proportional hazards model with other covariates
# 'prio' is our explanatory variable of interest

m1 <- coxph(Surv(week, arrest) ~ 
                       fin + age + race + prio,
                     data = df)

# Creating new df to get survival predictions
# Want separate curves for the the different 'fin' and 'race'
# groups as well as the 25th and 75th percentile of prio

newdf <- df %$%
  expand.grid(fin = levels(fin), 
                    age = 30, 
                    race = levels(race), 
                    prio = c(1,4))

# Obtain the fitted survival curve, then tidy 
# into a dataframe that can be used in ggplot

survcurv <- survfit(m1, newdata = newdf) %>%
  tidy()

The problem is, that once I have this dataframe called survcurv, I cannot tell which of the 'estimate' variables belongs to which pattern because none of the original variables are retained. For example, which of the 'estimate' variables represents the fitted curve for 30 year old, race = 'other', prio = '4', fin = 'no'?

In all other examples i've seen, usually one puts the survfit object into a generic plot() function and does not add a legend. I want to use ggplot and add a legend for each of the predicted curves.

In my own dataset, the model is a lot more complex and there are a lot more curves than I show here, so as you can imagine seeing 40 different 'estimate.1'..'estimate.40' variables makes it hard to understand what is what.

Community
  • 1
  • 1
RNB
  • 457
  • 1
  • 5
  • 14
  • Don't add your "answer" to your question. If you have a different solution, post your own answer. It's up to the community to vote on which answer they thing is best for the future, and you decide which answer you ultimately accept. – MrFlick Oct 05 '16 at 13:47

2 Answers2

4

Thanks for providing a well phrased question and a good example. I'm a little surpirsed that tidy does a relatively poor job here of creating sensible output. Please see below for my attempt at creating some plottable data:

library(tidyr)
newdf$group <- as.character(1:nrow(newdf))

survcurv <- survfit(m1, newdata = newdf) %>%
  tidy() %>% 
  gather('key', 'value', -time, -n.risk, -n.event, -n.censor) %>% 
  mutate(group = substr(key, nchar(key), nchar(key)),
         key   = substr(key, 1, nchar(key) - 2)) %>% 
  left_join(newdf, 'group') %>% 
  spread(key, value)

And the create a plot (perhaps you'd like to use geom_step instead, but there is not step shaped ribbon, unfortunately):

ggplot(survcurv, aes(x = time, y = estimate, ymin = conf.low, ymax = conf.high,
                     col = race, fill = race)) +
  geom_line(size = 1) +
  geom_ribbon(alpha = 0.2, col = NA) +
  facet_grid(prio ~ fin)

enter image description here

Axeman
  • 32,068
  • 8
  • 81
  • 94
  • This was exactly what I was looking for. I made a slight tweak to your code with my own dataset because `mutate(group = substr(key, nchar(key), nchar(key)), key = substr(key, 1, nchar(key) - 2)) ` won't work if you have more than 10 patterns or survival curves. I'll edit my original post to show what I did. Thanks again! – RNB Oct 05 '16 at 12:55
  • @RNB Ah yes, I was looking at `separate`, but couldn't get it to do what I wanted quickly enough. – Axeman Oct 05 '16 at 13:09
  • 1
    Yeah, I only thought of it because I had to fiddle with it a few days ago to get nice tidy output from the Effects package, so it was fresh in my mind. I'm also surprised that tidy() isn't more tidy with survfit objects. I've been using the broom package a lot lately and it usually works wonderfully. – RNB Oct 05 '16 at 13:12
  • Is there a way of doing this with continuous rather than categorical variables? If you try it where race is instead a value - eg 20 and 25 it throws up the error "Error in f(...) : Aesthetics can not vary with a ribbon" when plotting – redferry Mar 31 '20 at 15:00
  • @redferry, it's probably better if you ask a new question about that. – Axeman Mar 31 '20 at 16:25
3

Try defining your survcurv like this:

survcurv <- 
  lapply(1:nrow(newdf),
         function(x, m1, newdata){
           cbind(newdata[x, ], survfit(m1, newdata[x, ]) %>% tidy)
         },
         m1, 
         newdf) %>%
  bind_rows()

This will include all of the predictor values as columns with the predicted estimates.

Benjamin
  • 16,897
  • 6
  • 45
  • 65
  • I also accepted your answer as well as Axeman's. Unlike Axeman's answer, I did not need to do any tweaking to get it to work with more than 10 rows in newdf. However, I slightly preferred axeman's answer because the code is easier to read than functions within loops. And it is more in 'my style' of writing code. But thanks, this also works! – RNB Oct 05 '16 at 13:10