R: calculating time-at-risk by age

Question

I have a cohort study in which participants have a date-of-birth and dates when they entered and exited the study.

I am trying to calculate the time-at-risk (i.e. duration in the study) by age, sex and year.

For example, a participant who enters the study in June 2005 age 40.8 (in decimal) and remains in the study for one year would contribute 0.2 years in 2005 age 40, 0.3 years in 2005 age 41, and 0.5 years in 2006 age 41.

The data looks like this:

N <- 1000
set.seed(50)
d <- data.frame(
  sex = sample(c('m', 'f'), N, replace = T, prob = c(0.7, 0.3)),
  dob = sample(seq(as.Date('1960/01/01'), as.Date('1985/01/01'), by="day"), N, replace = T),
  study_entry = sample(seq(as.Date('2000/01/01'), as.Date('2010/01/01'), by="day"), N, replace = T)
)
d$study_exit <- d$study_entry + runif(N, 10, 2000)
d$age_entry <- as.numeric(d$study_entry - d$dob) / 365.25

And I am trying to create a summary table of follow-up duration that looks like this:

+--------+------+-----+--------------+
| year   | sex  | age | time at risk |
+--------+------+-----+--------------+
| 2000   | male | 20  | ....         |
+--------+------+-----+--------------+
| 2000   | male | 21  | ....         |
+--------+------+-----+--------------+
| 2000   | male | 22  | ....         |
+--------+------+-----+--------------+
| etc... | ...  | ... | ...          |
+--------+------+-----+--------------+

How would you go about this?

Your sample data is pretty large. Since you did not provide a desired output, there is no way to check the answers below (including mine) by hand to see if they actually deliver the output you desire. Better to create a (much) smaller test-data-set with just a couple of rows, so verification of answers is easy (or just add the desired output into your question itself). — Wimpel, Oct 29 '19 at 08:35

Wimpel · Accepted Answer · 2019-10-29T08:21:55.183

Here is an approach usig data.table and lubridate.

What it basically does, if to perform an overlap-join of your data on a lookup-table that exist of daily periods for the span of the study.

Then, for each 'match', the actual age of the participant on that day is calculated, based on the date-of-birth, using lubridate::as.period().

When all this is known, summarising is pretty easy (although I did not know what you exactly wanted to summarise, so I just summarised the total sum of days each group was at risk during thet year).

library(data.table)
library(lubridate)
#set d as data.table
setDT(d)
#over which years spans the study
ymin <- min( lubridate::year( d$study_entry ) )
ymax <- max( lubridate::year( d$study_exit ) )
#create lookup table of all days in study

dt.lookup <- data.table( from = seq( as.Date( paste0( ymin, "-01-01" ) ), 
                                     as.Date( paste0( ymax, "-12-31" ) ), 
                                     by = "day") )
dt.lookup[, to := from ]
#set keys
setkey( dt.lookup, from, to )
setkey( d, study_entry, study_exit )
#use foverlaps to join both data-sets
dt <- foverlaps( d, dt.lookup )
#now, we can calculate the age based on `from` and dob. 
# for other options, see: https://stackoverflow.com/questions/27096485/change-a-column-from-birth-date-to-age-in-r
dt[, 
   actual_age := floor( 
     as.numeric( 
       lubridate::as.period( interval( dob, from ), unit = "years" ), 
       "years" ) ) ]
#since each row is the duration of 1 day (from-to), 
# we can summarise using that knowledge
dt[, .( days_at_risk = .N ), 
   by = .( year = lubridate::year( from),
           sex = sex, 
           age = actual_age ) ]

output

#      year sex age days_at_risk
#   1: 2000   m  26         1459
#   2: 2000   m  27          848
#   3: 2001   m  27         2678
#   4: 2001   m  28         1752
#   5: 2000   f  24          248
# ---                          
# 723: 2013   m  44          242
# 724: 2013   m  40          146
# 725: 2014   m  41           82
# 726: 2015   m  42            8
# 727: 2015   m  43           58

updated code to use `lubridate` to calculate actual_age.. (which is about ....x times faster then `age_calc` from `eeptools`-package (which i previously used in this answer). — Wimpel, Oct 29 '19 at 08:18
This is brilliant! I would never have thought to use the overlaps join. Thank you very much. — Dan Lewer, Oct 29 '19 at 09:42
@DanLewer It's always cool to see functions you already knew of, being used in a way you hadn't thought of yet :) — Wimpel, Oct 29 '19 at 09:49

score 2 · Answer 2 · answered Oct 29 '19 at 08:29

Another general framework would be to use Lexis tools from the Epi package to split up each observation according to both the calendar time and age time scales:

library(tidyverse)

# Define time scales
lex <- Epi::Lexis(
  entry = list(year = study_entry),
  exit  = list(year = study_exit, age = study_exit - dob),
  data = mutate_if(d, lubridate::is.Date, Epi::cal.yr)
)

# Split observations into follow-up periods
lex_split <- lex %>% 
  Epi::splitLexis(time.scale = "age",  breaks = 15:50) %>% 
  Epi::splitLexis(time.scale = "year", breaks = 2000 + 0:10)

lex_split %>% 
  mutate(
    age = Epi::timeBand(., "age", type = "left"),
    year = Epi::timeBand(., "year", type = "left")
  ) %>% 
  group_by(year, sex, age) %>% 
  summarise(years_at_risk = sum(lex.dur))
#> # A tibble: 558 x 4
#> # Groups:   year, sex [22]
#>     year sex     age years_at_risk
#>    <dbl> <fct> <dbl>         <dbl>
#>  1  2000 f        15         0.523
#>  2  2000 f        16         0.947
#>  3  2000 f        17         0.152
#>  4  2000 f        18         0.286
#>  5  2000 f        19         1.10 
#>  6  2000 f        20         0.521
#>  7  2000 f        21         1.01 
#>  8  2000 f        22         0.396
#>  9  2000 f        23         0.344
#> 10  2000 f        24         0.677
#> # ... with 548 more rows

^{Created on 2019-10-29 by the reprex package (v0.3.0)}

Thank you - I was completely unaware of the Lexis tools. – Dan Lewer Oct 29 '19 at 10:49 — Dan Lewer, Oct 29 '19 at 10:49

score 0 · Answer 3 · answered Mar 14 '20 at 23:42

I've been using @Wimpel's solution, which works brilliantly - and then also realised that it can be slightly simplified by building a table of days in the study cohort directly:

library(data.table)
library(lubridate)
setDT(d)

d[, id := .I] # patient ID
dt <- d[, .(day = seq(from = study_entry, to = study_exit, by = 'day')), id]
dt <- d[, c('id', 'dob', 'sex')][dt, on = 'id']
dt[, actual_age := floor(as.numeric(day - dob) / 365.25)]
dt[, year := lubridate::year(day)]
dt[, .(days_at_risk = .N), c('actual_age', 'sex', 'year')]

R: calculating time-at-risk by age

3 Answers3