0

I want to merge two data frames in data.table by the same ID. The first data frame df1 has unique IDs, the second df2 has duplicated IDs. I only want to merge IDs from df2 with the closest date to df1. Every other ID in df2 should be dropped. I was trying the nearestTimeandID function in bayesbio, but it takes too much time. I tried to follow another comment suggesting the roll="nearest" in data.table but the IDs are still duplicated.

This is the function I used in nearestTimeandDate:

library(data.table)

together <- nearestTimeandID(df1, df2, timeCol1 = "Date1", timeCol2 = "Date2", IDcol = "ID") 

and with the roll = "nearest":

setDT(df1)[, join_date := Date1] 
setDT(df2)[, join_date := Date2] 
together <- df1[df2, on = .(ID, join_date), roll = "nearest"] 

How can I merge the datasets and drop the duplicated IDs that are not the closest date in a time efficient way? Thanks!


sample data:

d <- as.Date("2020-07-10")
df1 <- data.table(ID=1, Date1=d, Val1=11)
df2 <- data.table(ID=rep(1, 2), Date2=c(d-3, d+1), Val2=21:22)
chinsoon12
  • 25,005
  • 4
  • 25
  • 35
Svenja
  • 119
  • 5
  • Earlier I answered a similar question: https://stackoverflow.com/a/63194230/3358272. Otherwise, questions on SO do much better if they are reproducible and self-contained. By that I mean including attempted code (please be explicit about non-base packages), sample representative data (perhaps via `dput(head(x))` or building data programmatically (e.g., `data.frame(...)`), possibly stochastically after `set.seed(1)`), perhaps actual output (with verbatim errors/warnings) versus intended output. Refs: https://stackoverflow.com/q/5963269, [mcve], and https://stackoverflow.com/tags/r/info. – r2evans Jul 31 '20 at 19:57
  • try something like `df2[, rn := .I]; df2[unique(df2[df1, on=.(ID, Date2=Date1), roll="nearest", rn])]`. If there are 2 dates are equally close, what would you do with them? – chinsoon12 Aug 01 '20 at 23:35
  • I always want that 'Date1>Date2'. – Svenja Aug 03 '20 at 23:49

1 Answers1

0

This is a tidyverse answer but perhaps you can apply the same logic to data.table. The key is to only find the closest datetime once you've already joined. First, the two datasets, where df1 has 10 unique ids but df2 is 100,000 rows long with repeats. The datetimes are randomly sampled within a 200-day period:

library(tidyverse)

df1 <- data.frame(
  id = 1:10,
  datetime = Sys.time() + runif(10, -60*60*24*100, 60*60*24*100)
)

df2 <- data.frame(
  id = sample(1:10, 100000, replace = TRUE),
  datetime = Sys.time() + runif(100000, -60*60*24*100, 60*60*24*100)
)

Now we join on id but then also group_by it in the resulting dataframe, and then filter for rows with the minimal distance between the two datetime columns:

df1 %>%
  inner_join(df2, by = 'id', suffix = c('.1', '.2')) %>%
  group_by(id) %>%
  filter(abs(datetime.1 - datetime.2) == min(abs(datetime.1 - datetime.2)))

# A tibble: 10 x 3
# Groups:   id [10]
      id datetime.1          datetime.2         
   <int> <dttm>              <dttm>             
 1     1 2020-07-28 15:10:56 2020-07-28 15:18:37
 2     2 2020-10-26 03:33:34 2020-10-26 03:34:02
 3     3 2020-09-03 08:50:15 2020-09-03 08:52:15
 4     4 2020-08-01 07:33:25 2020-08-01 06:55:01
 5     5 2020-08-24 15:09:55 2020-08-24 15:07:50
 6     6 2020-09-18 02:16:17 2020-09-18 02:25:16
 7     7 2020-10-19 11:31:33 2020-10-19 12:28:08
 8     8 2020-10-06 07:42:08 2020-10-06 07:42:41
 9     9 2020-10-21 14:20:12 2020-10-21 14:28:04
10    10 2020-04-24 17:17:59 2020-04-24 17:08:51

Walker Harrison
  • 527
  • 3
  • 12