R: conditional join based on timestamp interval

Question

I have two data.frames:

# limits
   ID    Start_1      End_1    Start_2      End_2
1  A 2013-04-23 2013-06-09 2013-04-26 2017-02-06
2  B 2013-05-12 2013-08-08 2013-04-26 2017-02-06
3  C 2013-04-24 2013-04-26 2017-02-05 2017-02-08

and

# df (header shown)
   Timestamp          ID
1 2013-04-24 14:01:21  A
2 2013-04-24 14:01:46  B
3 2013-04-24 14:01:50  C
4 2013-04-25 00:02:19  A
5 2013-04-25 02:02:48  B
6 2013-04-25 04:02:04  C

I want to populate a column Pop in data.frame df based on the timestamp of its observations (df$Timestamp): if df$Timestamp is between two time limits (stored in data.frame, limits: limits$Start_1 and limits$End_1), the Pop column is populated with 'Yes', if not, 'No'.

If df$Timestamp is between the two next time limits (limits$Start_2 and limits$End_2), the Pop column is populated with 'Maybe', overwriting any 'Yes" or 'No'.

The set-up looks like this:

# main data.frame
df<-structure(list(Timestamp = structure(c(1366826481, 1366826506,
                                           1366826510, 1366862539, 1366869768, 1366876924, 1366948927, 1366948963,
                                           1367013725, 1367107304, 1367107308, 1367107316, 1486342833, 1486350011,
                                           1486350026, 1486429233, 1486436435, 1486436459, 1486515633, 1486522816,
                                           1486522834, 1486530052, 1486537217, 1486537251),
                                         class = c("POSIXct","POSIXt"), tzone = ""),
                   ID = structure(c(1L, 2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L,1L, 2L, 3L),
                                  .Label = c("A", "B", "C"), class = "factor")),
              .Names = c("Timestamp", "ID"), row.names = c(NA, -24L), class = "data.frame")  

# data.frame with time limits
limits<- structure(list(ID = structure(1:3, .Label = c("A", "B", "C"), class = "factor"),
                    Start_1 = structure(c(1366689600, 1368331200, 1366776000), class = c("POSIXct","POSIXt"), tzone = ""),
                    End_1 = structure(c(1370750400, 1375934400,1366948800), class = c("POSIXct", "POSIXt"), tzone = ""),
                    Start_2 = structure(c(1366948800, 1366948800, 1486270800), class = c("POSIXct", "POSIXt"), tzone = ""),
                    End_2 = structure(c(1486357200, 1486357200, 1486530000), class = c("POSIXct", "POSIXt"), tzone = "")),
               .Names = c("ID","Start_1", "End_1", "Start_2", "End_2"),
               row.names = c(NA,-3L), class = "data.frame")

I have two methods that seem to work but they are cumbersome and, for a real-life dataset (several thousands rows, hundreds of ID, and several $Start and $End periods for a same ID), it's hard to "trust" they work.

###### Method 1 ######
df1<-df
df1<-left_join(df1, limits, by="ID")

df1$Pop<-ifelse(df1$Timestamp>as.POSIXct(df1$Start_1) &
                 df1$Timestamp<as.POSIXct(df1$End_1), "Yes", "No")
df1$Pop<-ifelse(df1$Timestamp>as.POSIXct(df1$Start_2) &
                         df1$Timestamp<as.POSIXct(df1$End_2), "Maybe", df1$Pop)
df1$Pop<-as.factor(df1$Pop)
df1<-df1[,-c(3,6)]

###### Method 2 ######
df2<-df
df2<-df2[with(df2, order(ID, Timestamp)), ]
ids<-as.factor(levels(droplevels(df2$ID)))
tmp<-NULL
for(i in 1:length(ids)) {
  tmp[[i]]<-ifelse(df2$Timestamp[which(df2$ID==ids[i])]> as.POSIXct(limits$Start_1[i]) &
                     df2$Timestamp[which(df2$ID==ids[i])]< as.POSIXct(limits$End_1[i]), "Yes", "No") }
tmp<-data.frame(Pop = unlist(tmp)) # tmp is a list - this turns it into a data-frame
df2<-cbind(df2,tmp)

# add 'Maybe'
tmp2<-NULL
for(i in 1:length(ids)) {
  tmp2[[i]]<-df2$Timestamp[which(df2$ID==ids[i])]> as.POSIXct(limits$Start_2[i]) & 
    df2$Timestamp[which(df2$ID==ids[i])]< as.POSIXct(limits$End_2[i]) }
tmp2<-data.frame(Pop = unlist(tmp2))
df2$Pop<-as.character(df2$Pop)
df2$Pop[which(tmp2$Pop==TRUE)]<-'Maybe'
df2$Pop<-as.factor(df2$Pop)
df2<-df2[with(df2, order(Timestamp)), ]

Is there a more elegant way (using a function, package) to do this kind of join?

EDIT:
In Method 1's ifelse(), I was using limits$Start_1, limits$End_1, etc instead of df1$Start1, df1$End_1, etc.

[Related question](http://stackoverflow.com/questions/37289405/dplyr-left-join-by-less-than-greater-than-condition) — Jean, Mar 09 '17 at 01:39
Thanks. It seems like creating temporary columns in `df` with some kind of `left_join` based on `limits$Start_1` etc. , then deleting them after use is the common method. — YGS, Mar 09 '17 at 14:47

coomie · Accepted Answer · 2017-03-09T17:02:03.477

YGS, here is a data.table solution (using your set-up code above but with "as.data.table()" around the data.frames). Also assumed from your solutions that you did want "Maybe" to overwrite Yes/No answers.

library(data.table)

#Set keys on ID's for join
setkey(df, ID)
setkey(limits, ID)

#Join the data.tables on ID
DT <- df[limits]

#Create "pop" column and chain to desired columns from user output
DT <- DT[, ':=' (Pop = ifelse(Timestamp > Start_2 & Timestamp < End_2, "Maybe",
                         ifelse(Timestamp > Start_1 & Timestamp < End_1, "Yes","No")))][, c(1,2,5,6,7)]

UPDATE: a bit more elegant:

#Create "pop" column

DT <- df[limits, ':=' (Pop = ifelse(Timestamp > Start_2 & Timestamp < End_2, "Maybe",
                              ifelse(Timestamp > Start_1 & Timestamp < End_1, "Yes","No"))), by = .EACHI, on = "ID"]

Thanks Andrew. I understand that your solution is a `data.table` version (with a more concise conditional statement) of my Method 1 using `dplyr`. — YGS, Mar 09 '17 at 14:21

R: conditional join based on timestamp interval

1 Answers1

Linked