Find the minimum distance between two data frames, for each element in the second data frame

Question

I have two data frames ev1 and ev2, describing timestamps of two types of events collected over many tests. So, each data frame has columns "test_id", and "timestamp". What I need to find is the minimum distance of ev1 for each ev2, in the same test.

I have a working code that merges the two datasets, calculates the distances, and then uses dplyr to filter for the minimum distance:

ev1 = data.frame(test_id = c(0, 0, 0, 1, 1, 1), time=c(1, 2, 3, 2, 3, 4))
ev2 = data.frame(test_id = c(0, 0, 0, 1, 1, 1), time=c(6, 1, 8, 4, 5, 11))

data <- merge(ev2, ev1, by=c("test_id"), suffixes=c(".ev2", ".ev1"))

data$distance <- data$time.ev2 - data$time.ev1

min_data <- data %>%
  group_by(test_id, time.ev2) %>%
  filter(abs(distance) == min(abs(distance)))

While this works, the merge part is very slow and feels inefficient -- I'm generating a huge table with all combinations of ev2->ev1 for the same test_id, only to filter it down to one. It seems like there should be a way to "filter on the fly", during the merge. Is there?

Update: The following case with two "group by" columns fails when data.table approach outlined by akrun is used:

ev1 = data.frame(test_id = c(0, 0, 0, 1, 1, 1), time=c(1, 2, 3, 2, 3, 4), group_id=c(0, 0, 0, 1, 1, 1))
ev2 = data.frame(test_id = c(0, 0, 0, 1, 1, 1), time=c(5, 6, 7, 1, 2, 8), group_id=c(0, 0, 0, 1, 1, 1))
setkey(setDT(ev1), test_id, group_id)
DT <- ev1[ev2, allow.cartesian=TRUE][,distance:=abs(time-i.time)]

Error in eval(expr, envir, enclos) : object 'i.time' not found

have you tried `left_join(ev2,ev1, by=c("test_id") )` instead of `merge`? — Khashaa, Dec 14 '14 at 04:53
@Khashaa No, I didn't. I tried it now, and it's infinitely faster. Is it just a much better implementation of the same thing, or is something else going on? — Stan, Dec 14 '14 at 05:32
@Stan Have you tried the `data.table` approach. It should be very fast. — akrun, Dec 14 '14 at 05:37

score 6 · Accepted Answer · edited Oct 30 '17 at 21:46

Here's how I'd do it using data.table:

require(data.table)
setkey(setDT(ev1), test_id)
ev1[ev2, .(ev2.time = i.time, ev1.time = time[which.min(abs(i.time - time))]), by = .EACHI]
#    test_id ev2.time ev1.time
# 1:       0        6        3
# 2:       0        1        1
# 3:       0        8        3
# 4:       1        4        4
# 5:       1        5        4
# 6:       1       11        4

In joins of the form x[i] in data.table, the prefix i. is used to refer the columns in i, when both x and i share the same name for a particular column.

Please see this SO post for an explanation on how this works.

This is syntactically more straightforward to understand what's going on, and is memory efficient (at the expense of little speed¹) as it doesn't materialise the entire join result at all. In fact, this does exactly what you say in your post - filter on the fly, while merging.

On speed, it doesn't matter in most of the cases really. If there are a lot of rows in i, it might be a tad slower as the j-expression will have to be evaluated for each row in i. In contrast, @akrun's answer does a cartesian join followed by one filtering. So while it's high on memory, it doesn't evaluate j for each row in i. But again, this shouldn't even matter unless you work with really large i which is not often the case.

HTH

This is perfect -- like you say, exactly what I was hoping to find, something that doesn't create the huge join to begin with. When you say 'at the expense of little speed', what do you mean? What makes this a bit slower than creating the full join and then filtering? On my big testcase, it seemed to be even a little bit faster. — Stan, Dec 14 '14 at 14:26
@Stan, awesome! Glad to hear it helped. Edited to clarify your Q in the post. — Arun, Dec 14 '14 at 14:37

akrun · Answer 2 · 2014-12-14T06:13:24.517

0

May be this helps:

library(data.table)
setkey(setDT(ev1), test_id)
DT <- ev1[ev2, allow.cartesian=TRUE][,distance:=time-i.time]
DT[DT[,abs(distance)==min(abs(distance)), by=list(test_id, i.time)]$V1]
#    test_id time i.time distance
#1:       0    3      6        3
#2:       0    1      1        0
#3:       0    3      8        5
#4:       1    4      4        0
#5:       1    4      5        1
#6:       1    4     11        7

Or

 ev1[ev2, allow.cartesian=TRUE][,distance:= time-i.time][,
      .SD[abs(distance)==min(abs(distance))], by=list(test_id, i.time)]

Update

Using the new grouping

setkey(setDT(ev1), test_id, group_id)
setkey(setDT(ev2), test_id, group_id)
DT <- ev1[ev2, allow.cartesian=TRUE][,distance:=i.time-time]
DT[DT[,abs(distance)==min(abs(distance)), by=list(test_id, 
                                group_id,i.time)]$V1]$distance
#[1]  2  3  4 -1  0  4

Based on the code you provided

min_data$distance
#[1]  2  3  4 -1  0  4

edited Dec 14 '14 at 06:13

answered Dec 14 '14 at 05:08

akrun

874,273
37
540
662

I'm not familiar with data.table, so I'm going to have to look at this more closely to understand it :). One thing I had to change is to move abs() to the second line - abs(distance) == min(abs(distance)), so that I can still get negative distances (i.e. ev2 happened after ev1). – Stan Dec 14 '14 at 05:45
One question -- my real-life example had two columns (let's call them test_id and group_id) that together formed the grouping key. How would I change the code above to make that happen? I tried setkey(setDT(ev1), test_id, group_id), which went through, but then I got an error on the next line that object 'i.time' is not found, which puzzled me. – Stan Dec 14 '14 at 05:45
@Stan Yes, you can move that to the second line. I was only looking at the `min_data` as reference. Could you show an example that gives the error (makes it easier to look)? – akrun Dec 14 '14 at 05:47
I have added an update with an example that gives me the error. – Stan Dec 14 '14 at 05:55
@Stan You can use `setkey(setDT(ev2), test_id, group_id)` and then try the steps with `by=list(test_id, group_id, i.time)]` – akrun Dec 14 '14 at 06:09
I don't get that far, though, the error happens on the "DT <- ... " line. After a bit of experimenting, the order of columns seems to matter. If group_id right after test_id, then your code works. If it's at the end (as in the example above), then it doesn't. – Stan Dec 14 '14 at 06:14
@Stan Please show an example that mimics your dataset. From testing the code with the code that you showed, it is giving the same result. – akrun Dec 14 '14 at 06:15
@Stan I tried both ways and it works `setkey(setDT(ev1), group_id, test_id); setkey(setDT(ev2), group_id, test_id); ev1[ev2, allow.cartesian=TRUE][,distance:=i.time-time][] test_id time group_id i.time distance 1: 0 1 0 5 4 2: 0 2 0 5 3` – akrun Dec 14 '14 at 06:17
I didn't have the 'setkey(setDT(ev2))' line when I was testing -- it wasn't there in your first example. However, once I've put it in, the code does work, and it is very fast. Thank you! – Stan Dec 14 '14 at 06:21

Find the minimum distance between two data frames, for each element in the second data frame

2 Answers2

Update

Linked