For accident-analysis, I have to check if logged accidents from one system, exist in the logs from another system. Problem is that both systems are filled manually, so (small?) differences in location and time may occur.
For now, I've got this problem solved with a function, which I call with:
sys1log.df["match_1_900"] <- apply(sys1log.df, 1, bestMatch, marginLocation = 1, marginTime = 900)
marginLocation is the margin I want to use for the location of an incident. In this case the margin is 1, so all incidents in syslog2.df, which are logged between 0 and 2 are possible candidates for a match. The same goes for marginTime, in this example set to 900 seconds. All incidents from syslog2.df which are logged between a quarter of an hour before (or after) the incident from syslog1.df, are possible matches. The only thing I want to match 'hard' is the roadnumber.
The function bestMatch is:
bestMatch <- function (x, marginLocation, marginTime) {
location <- as.numeric( x[10] )
roadnumber <- as.numeric( x[9] )
time <- as.POSIXct( strptime(x[4], "%Y-%m-%d %H:%M:%S") )
require("dplyr")
df <- sys2log.df %>%
#filter rows that match criteria (within margins)
filter(road == roadnumber,
loc < location + marginLocation,
loc > location - marginLocation,
starttime < time + marginTime,
starttime > time - marginTime) %>%
#create column with absolute difference between time system1 and time system2
mutate(timeDifference = abs( as.numeric(time) - as.numeric(starttime) )) %>%
#sort on timeDifference
arrange(timeDifference)
#if a match is found, return the value in column 15 from the row with the smallest timeDifference)
if (length(df)) {
return(df[1,15])
} else {
return(NA)
}
}
This works fine, but the problem is that the logs contain >100.000 rows, so the apply-function takes about 15-30 minutes to run. I'm using multiple combination of location/time-margins, so I would really like to speed up things.
I think this can be done (much) faster, using data.table's rolling joins. My "problem" is that I would like to join on three keys, of which two should contain a rolling window/margin. Data.table only lets you apply a rolling join on one (the last) key.
I'm sure there is a way to achieve my goal with data.table (or another package), but I'm lost. Who can point me in the right direction?