Date is within an interval and pull field

Question

I have two df's: maindf and list.

ID <- c(1, 1, 1, 1, 5, 5)
SURVEY_DATE <- c("1997-08-01", "1998-08-20", "1998-11-20", "2000-12-13", "1998-05-02", "1998-12-25")
SURVEY_DATE <- as.Date(SURVEY_DATE)
maindf <- data.frame(ID, SURVEY_DATE)
maindf

ID <- c(1, 1, 1, 1, 5, 5)
ASSIGN_DATE <- c(1997, 1998, 1999, 2000, 1997, 1998)
TIME1 <- c("1997-07-23", "1998-11-17", "1999-12-15", "2000-12-11", "1998-04-07", "1998-12-06")
  TIME1 <- as.Date(TIME1)
TIME2 <- c("1998-11-17", "1999-12-15", "2000-12-11", "2001-12-30", "1998-12-06", "1999-11-28")
  TIME2 <- as.Date(TIME2)
list <- data.frame(ID, ASSIGN_DATE, TIME1, TIME2)
list

The maindf has a SURVEY_DATE field. This field needs to check in the list to see if it falls within TIME1 and TIME2 by ID. If it does, I would like to pull the ASSIGN_DATE into the maindf.

The final product should look like:

ID SURVEY_DATE     ASSIGN_DATE
1  1  1997-08-01     1997
2  1  1998-08-20     1997
3  1  1998-11-20     1998
4  1  2000-12-13     2000
5  5  1998-05-02     1997
6  5  1998-12-25     1998

I know this is very similar to this post and this post, but I'm having some trouble with pulling a field over by ID.

score 2 · Accepted Answer · answered Aug 11 '18 at 07:22

The OP has requested "to pull the ASSIGN_DATE into the maindf".

This can be achieved by an update join which modifies maindf by reference:

library(data.table)
setDT(maindf)[setDT(list), on = .(ID, SURVEY_DATE >= TIME1, SURVEY_DATE <= TIME2), 
       ASSIGN_DATE := i.ASSIGN_DATE][]

   ID SURVEY_DATE ASSIGN_DATE
1:  1  1997-08-01        1997
2:  1  1998-08-20        1997
3:  1  1998-11-20        1998
4:  1  2000-12-13        2000
5:  5  1998-05-02        1997
6:  5  1998-12-25        1998

DanY · Answer 2 · 2018-08-11T02:01:50.720

I lack the ingenuity to come up with anything more creative that a for loop right now, but at least this will get the job done:

# recreate data (because I like lowercase)
maindf <- data.frame(
    id = c(1, 1, 1, 1, 5, 5), 
    sdate = as.Date(c("1997-08-01", "1998-08-20", "1998-11-20", "2000-12-13", "1998-05-02", "1998-12-25")))

otherdf <- data.frame(
    id = c(1, 1, 1, 1, 5, 5),
    adate = c(1997, 1998, 1999, 2000, 1997, 1998),
    time1 = as.Date(c("1997-07-23", "1998-11-17", "1999-12-15", "2000-12-11", "1998-04-07", "1998-12-06")),
    time2 = as.Date(c("1998-11-17", "1999-12-15", "2000-12-11", "2001-12-30", "1998-12-06", "1999-11-28"))
)

# my sad loop
maindf$adate <- NA
for(i in 1:nrow(maindf)) {
    c1 <- otherdf$id    == maindf[i, "id"]
    c2 <- otherdf$time1 <  maindf[i, "sdate"]
    c3 <- otherdf$time2 >  maindf[i, "sdate"]
    maindf[i, "adate"] <- otherdf[c1 & c2 & c3, "adate"]
}

Maurits Evers · Answer 3 · 2018-08-11T02:34:57.533

Option 1: The data.table way

Using data.table::foverlaps

library(data.table)
setDT(maindf)[, `:=`(TIME1 = SURVEY_DATE, TIME2 = SURVEY_DATE)]
setDT(list)

# Interval-merge by TIME1 and TIME2
setkey(list, ID, TIME1, TIME2)
dt <- foverlaps(maindf, list)

# Clean up to reproduce expected output
dt[, .SD, .SDcols = c(names(maindf)[1:2], "ASSIGN_DATE")]
#   ID SURVEY_DATE ASSIGN_DATE
#1:  1  1997-08-01        1997
#2:  1  1998-08-20        1997
#3:  1  1998-11-20        1998
#4:  1  2000-12-13        2000
#5:  5  1998-05-02        1997
#6:  5  1998-12-25        1998

Explanation: foverlaps performs an overlap-join, based on the time intervals from two data.tables; foverlaps requires a start and end time point in each data.table, so we choose TIME1 = SURVEY_DATE as the start and TIME2 = SURVEY_DATA as the end point for maindf. foverlaps needs to know the keys by which to merge (here ID, TIME1 and TIME2) for the second argument of foverlaps which we set with setkey.

Option 2: The tidyverse/fuzzyjoin way

Using fuzzyjoin::fuzzy_left_join

library(fuzzyjoin)
library(tidyverse)
maindf %>% mutate(SURVEY_DATE = as.Date(SURVEY_DATE)) %>%
    fuzzy_left_join(
        list %>% mutate_at(vars(starts_with("TIME")), as.Date),
        by = c("ID" = "ID", "SURVEY_DATE" = "TIME1", "SURVEY_DATE" = "TIME2"),
        match_fun = list(`==`, `>=`, `<=`)) %>%
    rename(ID = ID.x) %>%
    select(names(maindf), ASSIGN_DATE)
#  ID SURVEY_DATE ASSIGN_DATE
#1  1  1997-08-01        1997
#2  1  1998-08-20        1997
#3  1  1998-11-20        1998
#4  1  2000-12-13        2000
#5  5  1998-05-02        1997
#6  5  1998-12-25        1998

score 0 · Answer 4 · answered Aug 11 '18 at 02:23

data.table "non-equi join" for the win:

#re-create data as data.tables and with lowercase
library(data.table)
maindt <- data.table(
    id = c(1, 1, 1, 1, 5, 5), 
    sdate = as.Date(c("1997-08-01", "1998-08-20", "1998-11-20", "2000-12-13", "1998-05-02", "1998-12-25")))

otherdt <- data.table(
    id = c(1, 1, 1, 1, 5, 5),
    adate = c(1997, 1998, 1999, 2000, 1997, 1998),
    time1 = as.Date(c("1997-07-23", "1998-11-17", "1999-12-15", "2000-12-11", "1998-04-07", "1998-12-06")),
    time2 = as.Date(c("1998-11-17", "1999-12-15", "2000-12-11", "2001-12-30", "1998-12-06", "1999-11-28"))
)

#one-line merge
maindt[otherdt, on = .(id==id, cond1 = sdate > time1, cond3 = sdate < time2), .(id, sdate=x.sdate, adate), nomatch=0]

The non-equi join syntax is a nightmare in my opinion, but I've always struggled with the dt1[dt2] merge style, so what do I know...

Chris Holbrook · Answer 5 · 2018-08-11T03:19:49.983

A base R solution using a full outer join and a conditional subset...

#full outer join 
foj <- merge(maindf, list, all = TRUE, by = "ID")
#conditional subset
df2 <- subset(foj, SURVEY_DATE >= TIME1 & SURVEY_DATE <= TIME2)

# > df2[, c("ID", "SURVEY_DATE", "ASSIGN_DATE")]
#     ID SURVEY_DATE       ASSIGN_DATE
# 1   1  1997-08-01        1997
# 5   1  1998-08-20        1997
# 10  1  1998-11-20        1998
# 16  1  2000-12-13        2000
# 17  5  1998-05-02        1997
# 20  5  1998-12-25        1998

Date is within an interval and pull field

5 Answers5