24

I have a timestamp in one data frame that I am trying to match to the closest timestamp in a second dataframe, for the purpose of extracting data from the second dataframe. See below for a generic example of my approach:

library(lubridate)

data <- data.frame(datetime=ymd_hms(c('2015-04-01 12:23:00 UTC', '2015-04-01 13:49:00 UTC', '2015-04-01 14:06:00 UTC' ,'2015-04-01 14:49:00 UTC')),
                   value=c(1,2,3,4))
reference <- data.frame(datetime=ymd_hms(c('2015-04-01 12:00:00 UTC', '2015-04-01 13:00:00 UTC', '2015-04-01 14:00:00 UTC' ,'2015-04-01 15:00:00 UTC', '2015-04-01 16:00:00 UTC')),
                        refvalue=c(5,6,7,8,9))

data$refvalue <- apply(data, 1, function (x){
  differences <- abs(as.numeric(difftime(ymd_hms(x['datetime']), reference$datetime)))
  mindiff <- min(differences)
  return(reference$refvalue[differences == mindiff])
})

data
#              datetime value refvalue
# 1 2015-04-01 12:23:00     1        5
# 2 2015-04-01 13:49:00     2        7
# 3 2015-04-01 14:06:00     3        7
# 4 2015-04-01 14:49:00     4        8

This works fine, except it is very slow, because the reference dataframe is quite large in my real-world application. Is this code properly vectorized? Is there a faster, more elegant way of performing this operation?

IRTFM
  • 258,963
  • 21
  • 364
  • 487
user278411
  • 455
  • 3
  • 8
  • In Python this is exactly what the function numpy.searchsorted is for. I looked for an R equivalent but can't find one so far .. – cxrodgers Jun 28 '15 at 19:35
  • @cxrodgers: It would be interesting to see an application of that function to this task. After looking at tis documentation, I'm scratching my head wondering how it would do this. Can you find an SO working example? – IRTFM Jun 28 '15 at 21:24
  • @BondedDust see the answer provided by Bi Rico here: http://stackoverflow.com/questions/8914491/finding-the-nearest-value-and-return-the-index-of-array-in-python – cxrodgers Jun 30 '15 at 02:24

2 Answers2

20

You can try data.tables rolling join using the "nearest" option

library(data.table) # v1.9.6+
setDT(reference)[data, refvalue, roll = "nearest", on = "datetime"]
# [1] 5 7 7 8
David Arenburg
  • 91,361
  • 17
  • 137
  • 196
  • 1
    In one line (without obtaining index): `setDT(data)[, refvalue := setkey(setDT(reference), datetime)[data, refvalue, roll="nearest"]]` - this'll get much nicer when joins with no keys is implemented. – Arun Jun 28 '15 at 20:57
  • 1
    Is it possible to also add the datetime used from reference and add that to data, for checking how close "nearest" actually is? – BetaScoo8 Feb 22 '16 at 21:13
  • 2
    @BetaScoo8 you could do something like `indx <- setkey(setDT(reference), datetime)[data, roll = "nearest", which = TRUE] ; setDT(data)[, c("refdate", "refvalue") := reference[indx]]` – David Arenburg Feb 22 '16 at 22:10
  • @BetaScoo8 Btw, you don't need to key anymore, you can now create `indx` using `indx <- setDT(reference)[data, roll = "nearest", which = TRUE, on = "datetime"]` (in v 1.9.6+) – David Arenburg Feb 22 '16 at 22:34
  • I initially selected the answer from 42-, but this is the approach I have ended up using on a daily basis. Both work though. data.tables are great for working with large data – user278411 Jun 15 '18 at 16:50
19

I wondered if this would be able to match a data.table solution for speed, but it's a base-R vectorized solution which should outperform your apply version. And since it doesn't actually ever calculate a distance, it might actually be faster than the data.table-nearest approach. This adds the length of the midpoints of the intervals to either the lowest possible value or the starting point of the the intervals to create a set of "mid-breaks" and then uses the findInterval function to process the times. That creates a suitable index into the rows of the reference dataset and the "refvalue" can then be "transferred" to the data-object.

 data$reefvalue <- reference$refvalue[
                      findInterval( data$datetime, 
                                     c(-Inf, head(reference$datetime,-1))+
                                     c(0, diff(as.numeric(reference$datetime))/2 )) ]
 # values are [1] 5 7 7 8
IRTFM
  • 258,963
  • 21
  • 364
  • 487