I am working on trying to do an merge on timestamp and another variable, specifically looking at how many ambulette pickups fall within a chain, the idea being that these are multiple pickups by the same ambulette in a chain considered the same overall trip, while those that are sole trips stand alone.
I've used sqldf for this but foverlaps seems much faster and more scalable so I would like to utilize this package or one like it. I am able to merge the two to find how many project start times fall within the chain times but I am not getting a return on the nonmatches in y.
Below is the code to reproduce, in this case with only one ambulette doenoted by the ID. I am using this link as a reference: Data Table merge based on date ranges
#example---
trips = data.table(
"ambulette_id" = "1"
,"pickup" = as.POSIXct(c("2017-08-01 04:30:54",
"2017-08-01 04:50:54", "2017-08-01 05:25:54", "2017-08-01 05:35:54",
"2017-08-01 07:45:54", "2017-08-01 08:15:54", "2017-08-01 09:15:54",
"2017-08-01 09:15:54", "2017-08-01 10:00:54", "2017-08-01 11:40:54",
"2017-08-01 12:00:54", "2017-08-01 12:40:54"), tz = "GMT")
,"dropoff" = as.POSIXct(c("2017-08-01 05:00:59",
"2017-08-01 05:00:59", "2017-08-01 05:55:59", "2017-08-01 05:55:59",
"2017-08-01 08:35:59", "2017-08-01 08:35:59", "2017-08-01 09:30:59",
"2017-08-01 09:45:59", "2017-08-01 10:30:59", "2017-08-01 11:50:59",
"2017-08-01 12:15:59", "2017-08-01 13:05:59"), tz = "GMT")
)[,pickup2:=pickup]
chains = data.table(
"ambulette_id" = "1"
,"ambulette_chain_start" = as.POSIXct(c("2017-08-01 04:30:54",
"2017-08-01 05:25:54", "2017-08-01 07:45:54", "2017-08-01 09:15:54"
), tz = "GMT")
,"ambulette_chain_end" = as.POSIXct(c("2017-08-01 05:00:59", "2017-08-01 05:55:59",
"2017-08-01 08:35:59", "2017-08-01 09:45:59"),tz = "GMT")
)
#The final result is to merge trips on chains to get the pickups in trips that start in between the ranges in chains. Any pickups that don't match, should still show up as lone pickups, but instead foverlaps dumps them. Is there anyway to keep them?
setkey(trips,ambulette_id,pickup, pickup2)
final_join = foverlaps(chains
,trips
,by.x = c("ambulette_id", "ambulette_chain_start", "ambulette_chain_end"))[
,pickup2:=NULL]
#test shows some trips not showing up in the final join
trips[!(pickup %in% final_join$pickup)]
Below is the SQL version that gets me the result I want:
#sqldf version
library(sqldf)
z = setDT(sqldf("SELECT
trips.ambulette_id,
trips.pickup,
trips.dropoff,
chains.ambulette_chain_start,
chains.ambulette_chain_end
FROM trips LEFT JOIN chains
ON trips.ambulette_id = chains.ambulette_id AND
pickup BETWEEN ambulette_chain_start AND ambulette_chain_end"))[
,chained:=ifelse(is.na(ambulette_chain_start), "no", "yes")]
z
UPDATE:
The first response seems to answer almost all of the question, but I would like to keep the ambulette chain start and end columns from the merge so that the end product looks like the following. How would I do that?
ambulette_id pickup dropoff ambulette_chain_start ambulette_chain_end chained
1: 1 2017-08-01 00:30:54 2017-08-01 01:00:59 2017-08-01 00:30:54 2017-08-01 01:00:59 yes
2: 1 2017-08-01 00:50:54 2017-08-01 01:00:59 2017-08-01 00:30:54 2017-08-01 01:00:59 yes
3: 1 2017-08-01 01:25:54 2017-08-01 01:55:59 2017-08-01 01:25:54 2017-08-01 01:55:59 yes
4: 1 2017-08-01 01:35:54 2017-08-01 01:55:59 2017-08-01 01:25:54 2017-08-01 01:55:59 yes
5: 1 2017-08-01 03:45:54 2017-08-01 04:35:59 2017-08-01 03:45:54 2017-08-01 04:35:59 yes
6: 1 2017-08-01 04:15:54 2017-08-01 04:35:59 2017-08-01 03:45:54 2017-08-01 04:35:59 yes
7: 1 2017-08-01 05:15:54 2017-08-01 05:30:59 2017-08-01 05:15:54 2017-08-01 05:45:59 yes
8: 1 2017-08-01 05:15:54 2017-08-01 05:45:59 2017-08-01 05:15:54 2017-08-01 05:45:59 yes
9: 1 2017-08-01 06:00:54 2017-08-01 06:30:59 <NA> <NA> no
10: 1 2017-08-01 07:40:54 2017-08-01 07:50:59 <NA> <NA> no
11: 1 2017-08-01 08:00:54 2017-08-01 08:15:59 <NA> <NA> no
12: 1 2017-08-01 08:40:54 2017-08-01 09:05:59 <NA> <NA> no
FURTHER UPDATE: implementing as suggested: 1. omitted the No assignment 2. added the chain assignments
trips[
chains, on = .(ambulette_id, pickup > ambulette_chain_start,
pickup < ambulette_chain_end)
,':='(chained = 'yes'
, ambulette_chain_start = ambulette_chain_start
,ambulette_chain_end = ambulette_chain_end)]
ambulette_id pickup dropoff pickup2 chained ambulette_chain_start
1: 1 2017-08-01 04:30:54 2017-08-01 05:00:59 2017-08-01 04:30:54 NA <NA>
2: 1 2017-08-01 04:50:54 2017-08-01 05:00:59 2017-08-01 04:50:54 yes 2017-08-01 04:30:54
3: 1 2017-08-01 05:25:54 2017-08-01 05:55:59 2017-08-01 05:25:54 NA <NA>
4: 1 2017-08-01 05:35:54 2017-08-01 05:55:59 2017-08-01 05:35:54 yes 2017-08-01 05:25:54
5: 1 2017-08-01 07:45:54 2017-08-01 08:35:59 2017-08-01 07:45:54 NA <NA>
6: 1 2017-08-01 08:15:54 2017-08-01 08:35:59 2017-08-01 08:15:54 yes 2017-08-01 07:45:54
7: 1 2017-08-01 09:15:54 2017-08-01 09:30:59 2017-08-01 09:15:54 NA <NA>
8: 1 2017-08-01 09:15:54 2017-08-01 09:45:59 2017-08-01 09:15:54 NA <NA>
9: 1 2017-08-01 10:00:54 2017-08-01 10:30:59 2017-08-01 10:00:54 NA <NA>
10: 1 2017-08-01 11:40:54 2017-08-01 11:50:59 2017-08-01 11:40:54 NA <NA>
11: 1 2017-08-01 12:00:54 2017-08-01 12:15:59 2017-08-01 12:00:54 NA <NA>
12: 1 2017-08-01 12:40:54 2017-08-01 13:05:59 2017-08-01 12:40:54 NA <NA>
ambulette_chain_end
1: <NA>
2: 2017-08-01 05:00:59
3: <NA>
4: 2017-08-01 05:55:59
5: <NA>
6: 2017-08-01 08:35:59
7: <NA>
8: <NA>
9: <NA>
10: <NA>
11: <NA>
12: <NA>
I'm implementing incorrectly I assume, since the result is different than I am getting with sql or the previous solution