Conditional join on multiple date ranges

Question

I have two dataframes: 'Probes' and 'Events'. The below code will generate reproducible sample of these dataframes. Probes.subset is a dataframe of all observations from Probes that intersect the datetime ranges in Events. The code below will generate 6 events...in reality I have nearly 200 discreet events.

Aim: I need to join the Event.name from Events, to each corresponding observation in Probes.subset based on the Timestamp so that each observation is associated with the correct Event.

I have previously tried a FOR loop but this is incredibly slow and doesn't finish as my data is over 180,000 rows long. I have a feeling this can be solved by writing a function and using something like apply but I am a complete newbie to functions in R and none of the ones I've created work.

library(dplyr)

# Generate Probes data
start <- as.POSIXct("01/06/2016 01:00", format = "%d/%m/%Y %H:%M")
end <- start + as.difftime(1, units = "days")

Timestamp <- seq(from = start, to = end, by = "10 mins")
Value <- round(runif(145) * 100, 2)

Probes <- data.frame(Timestamp, Value)

# Generate Events data
Event.name <- seq(1, 6)

Event.start <- as.POSIXct(c("01/06/2016 01:20", "01/06/2016 05:00",
                            "01/06/2016 06:30", "01/06/2016 12:00",
                            "01/06/2016 17:40", "01/06/2016 19:20"),
                          format = "%d/%m/%Y %H:%M")

Event.end <- as.POSIXct(c("01/06/2016 02:00", "01/06/2016 05:30",
                            "01/06/2016 07:20", "01/06/2016 14:00",
                            "01/06/2016 18:10", "01/06/2016 21:40"),
                          format = "%d/%m/%Y %H:%M")

Events <- data.frame(Event.name, Event.start, Event.end)

# Subset probes data to fall within Events bounds
Probes.subset <- Probes %>%
  mutate(InRange = Timestamp %in% unlist(Map(
    `:`,
    Events$Event.start,
    Events$Event.end
  ))) %>%
  filter(InRange == "TRUE")

[A similar question](http://stackoverflow.com/questions/23095896/merging-two-dataframes-on-a-date-range-in-r) deals with joining over date ranges but their example has a common column between the dataframes. I cannot comment on their question as my rep is too low. — Ciaran, May 05 '17 at 14:02
Does this produce the result you want? `unlist(lapply(Probes.subset$Timestamp, function(x) which(x >= Events$Event.start & x <= Events$Event.end)))`. You can assign this to a column in `Probes.subset` — Gopala, May 05 '17 at 14:26
@Gopala - I tried running that but get the following error: `Error in `$<-.data.frame`(`*tmp*`, "Event", value = c(1L, 1L, 1L, 1L, : replacement has 43 rows, data has 54` — Ciaran, May 05 '17 at 14:35
Yeah - sorry. Instead of `unlist`, use `as.integer`. It will fill missing values properly. `as.integer(lapply(Probes.subset$Timestamp, function(x) which(x >= Events$Event.start & x <= Events$Event.end, arr.ind = TRUE)))`. — Gopala, May 05 '17 at 14:56

Mike H. · Accepted Answer · 2017-05-05T14:59:46.497

5

In SQL you can use as to specify an alias for your different datasets while joining. This allows you to choose a specific column from each dataset. Adopting that answer's code you could do:

library(sqldf)
res <- sqldf("SELECT l.*, r.`Event.name`
       FROM Probes as l
       LEFT JOIN Events as r
       ON  l.Timestamp BETWEEN r.`Event.start` AND r.`Event.end`")

head(res)
#            Timestamp Value Event.name
#1 2016-06-01 01:00:00 60.73         NA
#2 2016-06-01 01:10:00 14.01         NA
#3 2016-06-01 01:20:00 17.14          1
#4 2016-06-01 01:30:00 43.64          1
#5 2016-06-01 01:40:00 27.05          1
#6 2016-06-01 01:50:00 57.10          1

One note - I would be careful about your data set up since in row 5 of Events you have a start time > end time.

Also a fast data.table solution would be to use foverlaps:

library(data.table)
setDT(Probes)[,Time2 := Timestamp] #Clean data
setDT(Events)[, `:=`(start = min(Event.start, Event.end), end = max(Event.start, Event.end)), by = (seq_len(nrow(Events)))]

setkey(setDT(Events), start, end)
res2 <- foverlaps(setDT(Probes), Events, by.x = c("Timestamp", "Time2"), nomatch = NA)[, c("Time2", "Event.start", "Event.end", "start","end") := NULL]

edited May 05 '17 at 14:59

answered May 05 '17 at 14:25

Mike H.

13,960
2
29
39

Ah I found it after I posted. Giving it a try now! – Ciaran May 05 '17 at 14:42
That worked perfectly! Thank you so much for your help, this has been a headwreck for quite a while now. I'm going to get my head around using SQL within R now...still a newbie! +1 – Ciaran May 05 '17 at 14:45
1

No problem! I find SQL can be great for these types of joins. Also if you're concerned about speed I edited to include a fast `data.table` method. Although the syntax is not super friendly – Mike H. May 05 '17 at 14:46
I'm trying to run a speed test on the two methods you have suggested but the data.table method keeps saying unepeced ']' and I can't seem to fix the damn thing! Your first solution worked *super quick*, despite the fact it was working on over 180,000 rows! – Ciaran May 05 '17 at 14:57
1

Yea, I edited the last part for readability I'll change it so it runs properly...sorry about that – Mike H. May 05 '17 at 14:59
Oops that Row 5 issue was me not being vigilant with date assignment. The actual data doesn't suffer that issue. Will update Question now. – Ciaran May 05 '17 at 15:02

Conditional join on multiple date ranges

1 Answers1