Find the closest date between dataset1 and dataset2

Question

I have two datasets. One that is collected approximately every 5 days and the other is collected every 15 minutes daily. I want a final list that matches the closest date from the less frequent dataset to the entry in the more frequent one.

For example:

satDat <- c('2015-04-16', '2015-04-21', '2012-04-26') # collected every 5 days

stationDat <- sort(rep(seq(as.Date("2015-04-01"), as.Date("2015-04-20"), by='day'),2)) 
#collected multiple times a day

 [1] "2015-04-01" "2015-04-01" "2015-04-02" "2015-04-02" "2015-04-03"
 [6] "2015-04-03" "2015-04-04" "2015-04-04" "2015-04-05" "2015-04-05"
[11] "2015-04-06" "2015-04-06" "2015-04-07" "2015-04-07" "2015-04-08"
[16] "2015-04-08" "2015-04-09" "2015-04-09" "2015-04-10" "2015-04-10"
[21] "2015-04-11" "2015-04-11" "2015-04-12" "2015-04-12" "2015-04-13"
[26] "2015-04-13" "2015-04-14" "2015-04-14" "2015-04-15" "2015-04-15"
[31] "2015-04-16" "2015-04-16" "2015-04-17" "2015-04-17" "2015-04-18"
[36] "2015-04-18" "2015-04-19" "2015-04-19" "2015-04-20" "2015-04-20"

I want my results to look like this

[1] "2015-04-16" "2015-04-16" "2015-04-16" "2015-04-16" "2015-04-16"
[6] "2015-04-16" "2015-04-16" "2015-04-16" "2015-04-16" "2015-04-16"
[11] "2015-04-16" "2015-04-16" "2015-04-16" "2015-04-16" "2015-04-16"
[16] "2015-04-16" "2015-04-16" "2015-04-16" "2015-04-16" "2015-04-16" 
[21] "2015-04-16" "2015-04-16" "2015-04-16" "2015-04-16" "2015-04-16"
[26] "2015-04-16" "2015-04-16" "2015-04-16" "2015-04-16" "2015-04-16"
[31] "2015-04-16" "2015-04-16" "2015-04-16" "2015-04-16" "2015-04-16"
[36] "2015-04-16" "2015-04-21" "2015-04-21" "2015-04-21" "2015-04-21"

Does the answer in this [question](http://stackoverflow.com/questions/31103897/matching-timestamped-data-to-closest-time-in-another-dataset-properly-vectorize) help? — conrad-mac, Feb 12 '17 at 04:08
`satDat[apply(abs(outer(satDat, stationDat, difftime, units = 'days')), 2, which.min)]`, though I suspect there's a more elegant option — alistaire, Feb 12 '17 at 04:24
Since one of the vectors is sorted, you can take advantage of `cut` to narrow the possible dates to check so that @alistaire's solution isn't so explosive. (It'd be more code, certainly, but very useful if your data is big enough.) — r2evans, Feb 12 '17 at 04:41
@alistaire I like your solution, however, once you change the stationDat to a longer time series say stationDat <- sort(rep(seq(as.Date("2015-04-01"), as.Date("2015-05-26"), by='day'),2)) The last numbers are just 2015-04-21 where I need them to be 2015-04-26. Can you explain what you did? — Christiana A, Feb 12 '17 at 05:43
2015-04-26 isn't in `satDat`; it has 201**2**-04-26, which will be far from everything...typo? Anyway, `outer` compares every combination of two vectors with the function supplied, here `difftime`, and returns a matrix, on which `abs` drops negatives. `apply` iterates over the columns (margin 2) and applies `which.min`, which returns the index of the smallest, which are used to subset `satDat`. If it was a typo, I can add it as an answer with a full explanation, as no alternative seems forthcoming. — alistaire, Feb 12 '17 at 06:14
@alistaire Thank you for spotting that. Yes, it was a typo. Thanks for your explanation. — Christiana A, Feb 12 '17 at 17:32

score 0 · Answer 1 · answered Feb 12 '17 at 10:44

The rolling join offered by package data.table comes to mind.

library(data.table)
DT1 <- data.table(date = as.Date(satDat), date1 = as.Date(satDat))
DT2 <- data.table(date = stationDat)

DT1[DT2, date1, roll = "nearest", on = .(date)]
# [1] "2015-04-16" "2015-04-16" "2015-04-16" "2015-04-16" "2015-04-16" "2015-04-16"
# [7] "2015-04-16" "2015-04-16" "2015-04-16" "2015-04-16" "2015-04-16" "2015-04-16"
#[13] "2015-04-16" "2015-04-16" "2015-04-16" "2015-04-16" "2015-04-16" "2015-04-16"
#[19] "2015-04-16" "2015-04-16" "2015-04-16" "2015-04-16" "2015-04-16" "2015-04-16"
#[25] "2015-04-16" "2015-04-16" "2015-04-16" "2015-04-16" "2015-04-16" "2015-04-16"
#[31] "2015-04-16" "2015-04-16" "2015-04-16" "2015-04-16" "2015-04-16" "2015-04-16"
#[37] "2015-04-21" "2015-04-21" "2015-04-21" "2015-04-21"

It's probably also useful for whatever your actual task is, because I suspect it goes beyond this.

score 0 · Accepted Answer · answered Feb 13 '17 at 00:51

An option using outer:

satDat[apply(abs(outer(satDat, stationDat, difftime, units = 'days')), 2, which.min)]

#>  [1] "2015-04-16" "2015-04-16" "2015-04-16" "2015-04-16" "2015-04-16"
#>  [6] "2015-04-16" "2015-04-16" "2015-04-16" "2015-04-16" "2015-04-16"
#> [11] "2015-04-16" "2015-04-16" "2015-04-16" "2015-04-16" "2015-04-16"
#> [16] "2015-04-16" "2015-04-16" "2015-04-16" "2015-04-16" "2015-04-16"
#> [21] "2015-04-16" "2015-04-16" "2015-04-16" "2015-04-16" "2015-04-16"
#> [26] "2015-04-16" "2015-04-16" "2015-04-16" "2015-04-16" "2015-04-16"
#> [31] "2015-04-16" "2015-04-16" "2015-04-16" "2015-04-16" "2015-04-16"
#> [36] "2015-04-16" "2015-04-21" "2015-04-21" "2015-04-21" "2015-04-21"

How it works:

outer takes applies difftime to each pair of elements in the two vectors, returning a matrix,
over which apply iterates over the columns (MARGIN = 2), calling which.min on each, which returns the index of the smallest,
which is used to subset satDat.

Note that outer allocates a matrix with dimensions of length(satDat) by length(stationDat), which can require a lot of memory if your data is already large.

Find the closest date between dataset1 and dataset2

2 Answers2