R: carrying values forward by an ID and only through a given date range

Question

I have a data table that looks like the following:

   id  firstd       lastd       treat
   1   2003-03-23   2003-03-25  1
   1   2003-03-24   2003-03-25  NA
   1   2003-03-25   2003-03-25  NA
   1   2003-05-13   2003-05-15  0
   1   2003-05-14   2003-05-15  NA
   1   2003-05-15   2003-05-15  NA
   2   2004-04-28   2004-04-30  0
   2   2004-04-29   2003-04-30  NA
   2   2004-04-30   2003-04-30  NA

I want to carry through the values of the column treat through the date range from firstd to firstd==lastd by id so the NAs are filled in with the given values.

Ideally, it will look like the following:

   id  firstd       lastd       treat
   1   2003-03-23   2003-03-25  1
   1   2003-03-24   2003-03-25  1
   1   2003-03-25   2003-03-25  1
   1   2003-05-13   2003-05-15  0
   1   2003-05-14   2003-05-15  0
   1   2003-05-15   2003-05-15  0
   2   2004-04-28   2004-04-30  0
   2   2004-04-29   2003-04-30  0
   2   2004-04-30   2003-04-30  0

I know how to carry a value through by one column, but have not done it with the added complexity of a given date range. Does anyone know how to do this?

The code I generally use when carrying forward values through given columns is the following-

    one[, treat:= treat[!is.na(treat)][1], by = id]

Does anyone know how to amend this piece of code to also consider the given date ranges? Or would have any further suggestions?

is there a typo in the last 2 lastd of your dataset? 2004 instead of 2003? — chinsoon12, Feb 11 '20 at 01:00

akrun · Answer 1 · 2020-02-11T00:48:45.710

We can group by 'id' and fill

library(dplyr)
library(tidyr)
one %>%
   group_by(id) %>%
   fill(treat)
# A tibble: 9 x 4
# Groups:   id [2]
#     id firstd     lastd      treat
#  <int> <chr>      <chr>      <int>
#1     1 2003-03-23 2003-03-25     1
#2     1 2003-03-24 2003-03-25     1
#3     1 2003-03-25 2003-03-25     1
#4     1 2003-05-13 2003-05-15     0
#5     1 2003-05-14 2003-05-15     0
#6     1 2003-05-15 2003-05-15     0
#7     2 2004-04-28 2004-04-30     0
#8     2 2004-04-29 2003-04-30     0
#9     2 2004-04-30 2003-04-30     0

If we are also using dates as grouping variable, then

one %>%
   group_by(id, grp = rleid(lastd)) %>%
   fill(treat)

Or if we also consider the 'firstd', then create a grouping variable based on the equality between the dates

one %>%
    group_by(id, grp = lag(cumsum(firstd == lastd), default = 0)) %>%
    fill(treat)

With data.table we can combine with na.locf0 from zoo

library(zoo)
library(data.table)
setDT(one)[, treat := na.locf0(treat), by = id][]
#   id     firstd      lastd treat
#1:  1 2003-03-23 2003-03-25     1
#2:  1 2003-03-24 2003-03-25     1
#3:  1 2003-03-25 2003-03-25     1
#4:  1 2003-05-13 2003-05-15     0
#5:  1 2003-05-14 2003-05-15     0
#6:  1 2003-05-15 2003-05-15     0
#7:  2 2004-04-28 2004-04-30     0
#8:  2 2004-04-29 2003-04-30     0
#9:  2 2004-04-30 2003-04-30     0

data

one <- structure(list(id = c(1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L),
 firstd = c("2003-03-23", 
"2003-03-24", "2003-03-25", "2003-05-13", "2003-05-14", "2003-05-15", 
"2004-04-28", "2004-04-29", "2004-04-30"), lastd = c("2003-03-25", 
"2003-03-25", "2003-03-25", "2003-05-15", "2003-05-15", "2003-05-15", 
"2004-04-30", "2003-04-30", "2003-04-30"), treat = c(1L, NA, 
NA, 0L, NA, NA, 0L, NA, NA)), class = "data.frame", row.names = c(NA, 
-9L))

Thanks for all your help!! does this take the different date ranges into account? You'll notice that the same ID can have different date ranges within the data table. — bziggy, Feb 11 '20 at 00:48
@boodaloo1 My update with `rleid` would take into consideration based on the 'lastd' — akrun, Feb 11 '20 at 00:49

score 1 · Answer 2 · answered Feb 11 '20 at 01:01

Another data.table option using non-equi join and then update by reference:

DT[is.na(treat), treat := 
    DT[!is.na(treat)][.SD, on=.(id, firstd<=lastd, lastd>=lastd), x.treat]
    ]

output:

   id     firstd      lastd treat
1:  1 2003-03-23 2003-03-25     1
2:  1 2003-03-24 2003-03-25     1
3:  1 2003-03-25 2003-03-25     1
4:  1 2003-05-13 2003-05-15     0
5:  1 2003-05-14 2003-05-15     0
6:  1 2003-05-15 2003-05-15     0
7:  2 2004-04-28 2004-04-30     0
8:  2 2004-04-29 2004-04-30     0
9:  2 2004-04-30 2004-04-30     0

data:

library(data.table)
DT <- fread("id  firstd       lastd       treat
1   2003-03-23   2003-03-25  1
1   2003-03-24   2003-03-25  NA
1   2003-03-25   2003-03-25  NA
1   2003-05-13   2003-05-15  0
1   2003-05-14   2003-05-15  NA
1   2003-05-15   2003-05-15  NA
2   2004-04-28   2004-04-30  0
2   2004-04-29   2004-04-30  NA
2   2004-04-30   2004-04-30  NA")
cols <- c("firstd", "lastd")
DT[, (cols) := lapply(.SD, as.IDate, format="%Y-%m-%d"), .SDcols=cols]

R: carrying values forward by an ID and only through a given date range

2 Answers2

data