Estimating density from interval [start, stop] data in R

Question

Description

The motivation for this question is from clinical/epidemiological research, wherein studies often enroll patients and then follow them for variable lengths of time.

The distribution of age at study entry is often of interest and is easily assessed, however there is occasional interest in the distribution of age at any time during the study.

My question is, is there a method for estimating such a density from interval data such as [age_start, age_stop] without expansion of the data as below? The long-format method seems inelegant, to say nothing of its memory usage!

Reproducible example using data from the survival package

#### Prep Data ###
library(survival)
library(ggplot2)
library(dplyr)

data(colon, package = 'survival')
# example using the colon dataset from the survival package
ccdeath <- colon %>%
  # use data on time to death (not recurrence)
  filter(etype == 2) %>%
  # age at end of follow-up (death or censoring)
  mutate(age_last = age + (time / 365.25))

#### Distribution Using Single Value ####
# age at study entry
ggplot(ccdeath, aes(x = age)) +
  geom_density() +
  labs(title = "Fig 1.",
       x = "Age at Entry (years)",
       y = "Density")

#### Using Person-Month Level Data ####
# create counting-process/person-time dataset
ccdeath_cp <- survSplit(Surv(age, age_last, status) ~ ., 
                        data = ccdeath,
                        cut = seq(from = floor(min(ccdeath$age)),
                                  to = ceiling(max(ccdeath$age_last)),
                                  by = 1/12))

nrow(ccdeath_cp) # over 50,000 rows

# distribution of age at person-month level
ggplot(ccdeath_cp, aes(x = age)) +
  geom_density() +
  labs(title = "Figure 2: Density based on approximate person-months",
       x = "Age (years)",
       y = "Density")

#### Using Person-Day Level Data ####
# create counting-process/person-time dataset
ccdeath_cp <- survSplit(Surv(age, age_last, status) ~ ., 
                        data = ccdeath,
                        cut = seq(from = floor(min(ccdeath$age)),
                                  to = ceiling(max(ccdeath$age_last)),
                                  by = 1/365.25))

nrow(ccdeath_cp) # over 1.5 million rows!

# distribution of age at person-month level
ggplot(ccdeath_cp, aes(x = age)) +
  geom_density() +
  labs(title = "Figure 3: Density based on person-days",
       x = "Age (years)",
       y = "Density")

Note: while I tagged this question with "survival" because I thought it would attract people familiar with this area, I am not interested in time-to-event here, just the overall age distribution of all time under study.

While this is a well formulated question, it lacks [reproducibility](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) such as the dataset used or an example dataset. — Oliver, Sep 04 '20 at 19:21
Also "**distribution of age at any time during the study**", this does not seem to be what you're analysing at the moment. What you are currently looking at is the distribution of age at a more and more granuar level. eg. you are reducing between-distance in `cut` and not (as your question implies) the age in different intervals. — Oliver, Sep 04 '20 at 19:23
@Oliver -- were you not able to reproduce it? The colon dataset should be available to you when you load the survival library. — user1231088, Sep 04 '20 at 19:44
Ah, my apologies, this was not clear from the code example. Adding `data(colon)` at the start of the example will make this clear to every user (including me). :-) — Oliver, Sep 04 '20 at 19:53
The problem with using processed data for this example is that the dates used to generate the data are missing. Because enrollment is staggered, and the ages are not all gathered at the same calendar time, you cannot accurately determine the age at any given "real" time point. You can only assess what the distribution of ages were at the "synthetic" time points referenced to a mythical time=0. — IRTFM, Sep 16 '20 at 19:42

score 0 · Answer 1 · answered Sep 04 '20 at 21:26

Rather than calculate for finer and finer time intervals you can just keep a cumulative count of the number of patients at a particular age

setDT(ccdeath)
x <- rbind(
  ccdeath[, .(age = age, num_patients = 1)],
  ccdeath[, .(age = age_last, num_patients = -1)]
)[, .(num_patients = sum(num_patients)), keyby = age]

cccdeath <- x[x[, .(age = unique(age))], on = 'age']
cccdeath[, num_patients := cumsum(num_patients)]
ggplot(cccdeath, aes(x = age, y = num_patients)) + geom_step()

The sawtooth pattern is because every patient is assumed to start at an integer age. Had some thoughts about how you'd smooth this and came up with this idea - assign equal probabilites to a set of evenly spaced ages between the given age and age+1. You get something like this,

smooth_param <- 12
x <- rbindlist(lapply(
  (1:smooth_param-0.5)/smooth_param,
  function(s) {
    rbind(
      ccdeath[, .(age = age+s, num_patients = 1/smooth_param)],
      ccdeath[, .(age = age_last+s, num_patients = -1/smooth_param)]
    )
  }
))[, .(num_patients = sum(num_patients)), keyby = age]

cccdeath <- x[x[, .(age = sort(unique(age)))], on = 'age']
cccdeath[, num_patients := cumsum(num_patients)]
ggplot(cccdeath, aes(x = age, y = num_patients)) + geom_step()

score 0 · Answer 2 · answered Sep 04 '20 at 22:00

I would proceed along these lines:

If you are interested in knowing the age distribution after t days in the study, the age will simply be the age at enrollment plus t days. The exceptions that you need to handle those who have died or have been right-censored. In your example, you seem to have kept people's age "frozen" at the time they left the study. Personally I think the age distribution of survivors who have not been censored is more useful in a survival analysis, but I will stick to your set-up for this example.

The two possibilities for each patient at time t then are to have age at enrollment plus t if t is less than follow-up time. Otherwise the age will be the age at enrollment plus the follow-up time.

If you wrap this in a function, you can see how the age distribution changes throughout the study. For completeness we will always plot a faint density of age at enrollment, and a line indicating the current mean age:

age_distribution <- function(df, t = 0)
{
  df %>% 
    mutate(age_at_t = age + ifelse(time > t, t, time) / 365.25) %>%
    ggplot() +
    geom_density(aes(age), linetype = 2, colour = "gray50") +
    geom_density(aes(age_at_t)) +
    geom_vline(aes(xintercept = mean(age_at_t)), color = "red", linetype = 2) +
    labs(x = paste("Age at day", t, "of study"),
         y = "Density",
         title = paste("Age distribution after", t, "days in study"))
}

So, for example:

age_distribution(ccdeath, 0)

And after 1 year:

age_distribution(ccdeath, 365)

And after 5 years:

age_distribution(ccdeath, 5 * 365.25)

For completeness, the equivalent function with censored / dead patients removed would be like this:

age_distribution <- function(df, t = 0)
{
  df %>% 
    filter(time > t) %>%
    mutate(age_at_t = age + t / 365.25) %>%
    ggplot() +
    geom_density(data = df, aes(age), linetype = 2, colour = "gray50") +
    geom_density(aes(age_at_t)) +
    geom_vline(aes(xintercept = mean(age_at_t)), color = "red", linetype = 2) +
    labs(x = paste("Age at day", t, "of study"),
         y = "Density",
         title = paste("Age distribution after", t, "days in study"))
}

So we can see how the distribution's shape changes after 5 years of follow-up:

age_distribution(ccdeath, 5 * 365.25)

This shows more clearly that there is a disproportionate loss of older people from the initial cohort.

Estimating density from interval [start, stop] data in R

Description

Reproducible example using data from the survival package

2 Answers2