Join on two columns/keys, with margin

Question

For accident-analysis, I have to check if logged accidents from one system, exist in the logs from another system. Problem is that both systems are filled manually, so (small?) differences in location and time may occur.

For now, I've got this problem solved with a function, which I call with:

sys1log.df["match_1_900"] <- apply(sys1log.df, 1, bestMatch, marginLocation = 1, marginTime = 900)

marginLocation is the margin I want to use for the location of an incident. In this case the margin is 1, so all incidents in syslog2.df, which are logged between 0 and 2 are possible candidates for a match. The same goes for marginTime, in this example set to 900 seconds. All incidents from syslog2.df which are logged between a quarter of an hour before (or after) the incident from syslog1.df, are possible matches. The only thing I want to match 'hard' is the roadnumber.

The function bestMatch is:

bestMatch <- function (x, marginLocation, marginTime) {
  location <- as.numeric( x[10] )                                
  roadnumber  <- as.numeric( x[9] )                                 
  time <- as.POSIXct( strptime(x[4], "%Y-%m-%d %H:%M:%S") )  

  require("dplyr")
  df <- sys2log.df %>%
    #filter rows that match criteria (within margins)
    filter(road == roadnumber, 
           loc < location + marginLocation, 
           loc > location - marginLocation, 
           starttime < time + marginTime, 
           starttime > time - marginTime) %>%
    #create column with absolute difference between time system1 and time system2
    mutate(timeDifference = abs( as.numeric(time) - as.numeric(starttime) )) %>%
    #sort on timeDifference
    arrange(timeDifference)
    #if a match is found, return the value in column 15 from the row with the smallest timeDifference)
    if (length(df)) {
      return(df[1,15])
    } else {
      return(NA)
    }
}

This works fine, but the problem is that the logs contain >100.000 rows, so the apply-function takes about 15-30 minutes to run. I'm using multiple combination of location/time-margins, so I would really like to speed up things.

I think this can be done (much) faster, using data.table's rolling joins. My "problem" is that I would like to join on three keys, of which two should contain a rolling window/margin. Data.table only lets you apply a rolling join on one (the last) key.

I'm sure there is a way to achieve my goal with data.table (or another package), but I'm lost. Who can point me in the right direction?

See https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example/28481250#28481250 regarding making a minimal complete reproducible example. — Frank, Sep 13 '17 at 12:01

score 0 · Accepted Answer · edited Sep 13 '17 at 13:39

It's typically a situation where you shouldn't use apply, you're converting your data.frame to a matrix then at each iteration reconverting every value.

use purrr::pmap instead to iterate on the chosen columns.

Don't sort your data when you're only looking for a minimum value only, use which.min, (and keep only the first result in case of multiple solutions).

Your test on length(df)) is counting the columns of the data.frame so it will never fail, I think you meant to test for nrows. I just skipped it as you can just test afterwards what object you received.

As you don't provide a reproducible example I can't guarantee that it works as I'm a lousy blind coder :). But it should point you to the solution.

# I'm supposing that the indices 10 9 and 4 are for loc, road, and starttime, and that in the original format the columns are well formatted

get_new_col <- function(marginLocation = 1, marginTime = 900){
sys1log.df["match_1_900"] <- sys1log.df %>% select(loc,road,starttime) %>%
  pmap(function(location,road_number,time){
    filter(sys1log.df %>%
             filter(road == roadnumber, 
                   loc < location + marginLocation, 
                   loc > location - marginLocation, 
                   starttime < time + marginTime, 
                   starttime > time - marginTime) %>%
             %>% {.[which.min(abs(time-starttime))[1],"timeDifference"]}
  }
}

sys1log.df["match_1_900"] <- get_new_col()

Not sure how it happened, but I completely missed your reply. My apologies! I will have a look at your solution this afternoon. Short question: In your code, I never see sys2log.df being used... typo? — Wimpel, Oct 06 '17 at 09:32

Join on two columns/keys, with margin

1 Answers1