1

Hello I am learning about survival analysis and I was curious if I could use the survival package on survival data of this form:

enter image description here

Here is some code to genereate data in this form

start_interval <-  seq(0, 13)
end_interval <-  seq(1, 14)
living_at_start <- round(seq(1000, 0, length.out = 14))
dead_in_interval <- c(abs(diff(living_at_start)), 0)
df <- data.frame(start_interval, end_interval, living_at_start, dead_in_interval)

From my use of the survival package so far it seems to have each individual be a survival time but I might be misreading the documentation of the Surv function. If survival will not work what other packages are out there for this type of data. If there is not a package or function to easily to estimate the survival function I can easily calculate the survival times myself with the following equation.

enter image description here

  • 1
    Please do not post photos of data or code! If you do, people who are willing to help you would have to type out all that text. Instead provide a [minimal reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example/5963610#5963610) P.S. Here is [a good overview on how to ask a good question](https://stackoverflow.com/help/how-to-ask) – dario Oct 05 '21 at 13:54
  • For survival analyses using the survival package, you should have one observation per patient which is standard in the field – csgroen Oct 05 '21 at 14:03
  • @dario added some code to help generate some data in this form – Victor Feagins Oct 05 '21 at 14:19
  • @csgroen Would I need the transform the data to replicate the particular survival interval by number of deaths if I wanted to use the survival package? – Victor Feagins Oct 05 '21 at 14:21
  • 1
    I think so... the package is not built with the data in the format you have in mind. However, you can probably manually plot a Kaplan-Meier curve with what you have. – csgroen Oct 05 '21 at 15:31
  • @csgron I did the duplication. I answered the question with this work. It gets the same answer as doing the manual calculation but there are a few things I am uncertain of. – Victor Feagins Oct 05 '21 at 15:42

1 Answers1

2

Since the survival package need one observation per survival time we need to do some transformations. Using the simulated data.

Simulated Data:

library(survival)
start_interval <-  seq(0, 13)
end_interval <-  seq(1, 14)
living_at_start <- round(seq(1000, 0, length.out = 14))
dead_in_interval <- c(abs(diff(living_at_start)), 0)
df <- data.frame(start_interval, end_interval, living_at_start, dead_in_interval)

Transforming the data by duplicated by the number dead

duptimes <- df$dead_in_interval
rid <- rep(1:nrow(df), duptimes)
df.t <- df[rid,]

Using the Surv Function

test <- Surv(time = df.t$start_interval,
     time2 = df.t$end_interval,
     event = rep(1, nrow(df.t)), #Every Observation is a death
     type = "interval")

Fitting the survival curve

summary(survfit(test ~ 1))

enter image description here

Comparing with by hand calculation from original data

df$living_at_start/max(df$living_at_start)

enter image description here

They match.

Questions

When using the survfit function why is number of risk 1001 at time 0 when there is only 1000 people in the data?

length(test)

enter image description here

  • 1
    Regarding the different number at risk I asked the question on Cross Validated [link](https://stats.stackexchange.com/questions/547141/survival-package-why-does-interval-censoring-have-a-different-number-at-risk) – Victor Feagins Oct 05 '21 at 16:34