0

I have two datasets that I would like to merge using inner_join in R. The problem is that the second dataset contains dates ranges, and I would like to keep this information. How can I match the date of the 1st dataset with the date-range of the 2nd dataset? Below a working example.

Thank you very much.

library(data.table)
library(dplyr)

# First Dataset
dt_1 <- data.table()
dt_1$city <- c("madrid","milan","milan","paris", "Rome")
dt_1$address <- c("a","a","b","c","d")
dt_1$date_1 <- c( "2017", "2013", "2008", "1901","2009")
dt_1


# Second dataset
dt_2 <- data.table()
dt_2$city <- c("milan","madrid","Porto","Barcelona", "Rome")
dt_2$address <- c("a","a","b","c","d")
dt_2$date_1 <- c( "2012", "2016", "2006", "1900","2009")
dt_2$date_2 <- c( "2015", "NA", "2022", "1930","NA")
dt_2


## How to match the corresponding exact dates of the two datasets BUT ALSO the dates falling -
## in the ranges

# This keeps only if the first date is the same
dt_match <- inner_join(dt_1, dt_2, by = c("city","address","date_1"), keep = TRUE)


# How to achieve this ? 
dt_match <- data.table()
dt_match$city <- c("milan","Rome")
dt_match$address <- c("a","c")
dt_match$date <- c( "2013","2009")
dt_match
ab_ed
  • 25
  • 4
  • Does this help? https://stackoverflow.com/questions/27487949/how-to-perform-join-over-date-ranges-using-data-table – Marco_CH Jan 13 '22 at 10:58
  • (1) You need to be using numbers, not strings, for your values. Why? `"222" >= "2017"` is true due to its lexicographic sorting. Fix with `as.integer` (or `as.numeric`). (2) This is not a join: if it were, then `dt_1[2,]` would find a match in `dt_2[3,]`. It appears that Waldi is correct, this is just a row-wise comparison. (3) If it *were* a join, though, after fixing numbers a start could be `dt_1[dt_2, date_2 := i.date_2, on = .(date_1 >= date_1, date_1 <= date_2)]`. – r2evans Jan 13 '22 at 12:43

2 Answers2

1

According to the expected result, looks like a row-wise comparison:

setDT(dt_1)
setDT(dt_2)
dt <- cbind(dt_1,dt_2)
setnames(dt,c("date_1","start","end"))
dt[,end:= fifelse(end=="NA",start,end)]
dt[,.(date_1=fifelse(date_1 %between% .(start,end),date_1,NA))]

   date_1
   <char>
1:   2013
2:   <NA>
3:   2008
4:   1901
5:   2009

If this isn't a row-wise comparison, 2017 should belong to the result because it belongs to range 2006-2022 (dt_2[3])

Waldi
  • 39,242
  • 6
  • 30
  • 78
  • Thank you very much for your answer, I edited the question because it seemed like a row wise comparison but in reality I would like to achieve a match based also on another two columns. – ab_ed Jan 13 '22 at 16:28
  • Then @r2evans answer in comments is the right one ;-) – Waldi Jan 13 '22 at 16:50
0

Thanks for your suggestion, I adjusted based on your codes, the solution I used is not elegant at all but it should work:

dt_3 <- inner_join(dt_1, dt_2, by = c("city","address"), keep = TRUE)

dt_3$found_range <- ifelse(sapply(seq_along(dt_3$date_1), function(i) {
  inds <- dt_3$date_2 <= dt_3$date_1[i] & dt_3$date_3 >= dt_3$date_1[i]
  any(inds) & (dt_3$city[i] == dt_3$city[which.max(inds)])
}), 1, NA)
dt_3

dt_3$found_date <- ifelse(sapply(seq_along(dt_3$date_1), function(i) {
  inds <- dt_3$date_2 == dt_3$date_1[i]
  any(inds) & (dt_3$city[i] == dt_3$city[which.max(inds)])
}), 1, NA)

dt_3

dt_3$found_comb <- rowSums(dt_3[,c("found_range", "found_date")], na.rm=TRUE)
dt_3
dt_3[!dt_3$found_comb == 0]
ab_ed
  • 25
  • 4