I have two data frames ev1 and ev2, describing timestamps of two types of events collected over many tests. So, each data frame has columns "test_id", and "timestamp". What I need to find is the minimum distance of ev1 for each ev2, in the same test.
I have a working code that merges the two datasets, calculates the distances, and then uses dplyr to filter for the minimum distance:
ev1 = data.frame(test_id = c(0, 0, 0, 1, 1, 1), time=c(1, 2, 3, 2, 3, 4))
ev2 = data.frame(test_id = c(0, 0, 0, 1, 1, 1), time=c(6, 1, 8, 4, 5, 11))
data <- merge(ev2, ev1, by=c("test_id"), suffixes=c(".ev2", ".ev1"))
data$distance <- data$time.ev2 - data$time.ev1
min_data <- data %>%
group_by(test_id, time.ev2) %>%
filter(abs(distance) == min(abs(distance)))
While this works, the merge part is very slow and feels inefficient -- I'm generating a huge table with all combinations of ev2->ev1 for the same test_id, only to filter it down to one. It seems like there should be a way to "filter on the fly", during the merge. Is there?
Update: The following case with two "group by" columns fails when data.table approach outlined by akrun is used:
ev1 = data.frame(test_id = c(0, 0, 0, 1, 1, 1), time=c(1, 2, 3, 2, 3, 4), group_id=c(0, 0, 0, 1, 1, 1))
ev2 = data.frame(test_id = c(0, 0, 0, 1, 1, 1), time=c(5, 6, 7, 1, 2, 8), group_id=c(0, 0, 0, 1, 1, 1))
setkey(setDT(ev1), test_id, group_id)
DT <- ev1[ev2, allow.cartesian=TRUE][,distance:=abs(time-i.time)]
Error in eval(expr, envir, enclos) : object 'i.time' not found