8

I want to join two tables xxx and yyy using a composite unique key and date ranges. In sql I would simply specify in the join but I cannot get dplyr to work. Is this possible?

test<- inner_join(xxx, yyy, by = c("ID" = "ID",
                               "NRA"="NRA",
                               "date_low">"date",
                               "date_high"<"date"),
                               copy = FALSE)
J. Doe.
  • 1,255
  • 1
  • 12
  • 25
  • Can you provide input data and expected output? – user1357015 Aug 17 '18 at 16:16
  • 1
    Use `fuzzy_join` and then use `matches = c('=', '=', ">", "<")` – akrun Aug 17 '18 at 16:18
  • @user1357015 I cannot. The actual data is restricted. Example data is difficult to construct. `xxx.date_low` would almost always be lower than `yyy.date` which will almost always be higher than `xxx.date_high`. It's the functionality that I am after. – J. Doe. Aug 17 '18 at 16:24
  • If you can't share this *exact* dataset, you can find a commonly available dataset that replicates the issue, or make up some data. Without a [reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example), it's difficult to do more than guess as to how to fix the problem – camille Aug 17 '18 at 16:52

2 Answers2

8

We could use fuzzy_inner_join from fuzzy_join

library(fuzzy_join)
fuzzy_inner_join(xxx, yyy,
              by = c("ID" = "ID",
                           "NRA"="NRA",
                           "date_low" =  "date",
                           "date_high" = "date"), 
              match_fun = list("==", "==", ">", "<"))
akrun
  • 874,273
  • 37
  • 540
  • 662
  • Thank you akrun. I did just that and I got this error https://stackoverflow.com/questions/44383510/r-passing-arguments-into-the-match-fun-function-in-fuzzyjoinfuzzy-join – J. Doe. Aug 17 '18 at 16:30
  • @J.Doe. Without a reproducible example, it is difficult. The code is standard syntax – akrun Aug 17 '18 at 16:33
  • 8
    @akrun It seems some people have found that using backticks instead of quotes in the match_fun list will make it work. – Arthur Yip Mar 19 '19 at 06:33
8

First of all, thank you for trying to help me. I realize my question is incomplete. I moved away from fuzzyjoin because of all the bioconductor dependencies.

I used sqldf instead to accomplish the task:

library(sqldf)
sqldf("SELECT * FROM xxx
            LEFT JOIN yyy
            ON  xxx.ID  = yyy.ID
            AND xxx.NRA = yyy.NRA
            AND yyy.date BETWEEN xxx.date_low AND xxx.date_high")

The result is almost identical to this question but I suspect it can also be solved with that question as per Uwe's data.table solution.

I am also linking this rstudio response

J. Doe.
  • 1,255
  • 1
  • 12
  • 25