I was wondering whether someone knows if the dplyr
extension packages (dbplyr
and dtplyr
) allow non-equi joins within the usual dplyr workflow? I rarely need data.table
, but fast non-equi joins are the only moments where I always need to setDT
, then perform the join and then switch back with as_tibble()
. I scrolled through the issues in the packages on github but didn't find whether this is something that's planned or already implemented.
Asked
Active
Viewed 1,634 times
11

b_surial
- 512
- 4
- 14
-
5Interesting, but not yet pushed forward : https://github.com/tidyverse/dtplyr/issues/133 – Waldi Feb 25 '21 at 17:40
-
https://www.r-bloggers.com/2021/02/the-unequalled-joy-of-non-equi-joins/ – Waldi Feb 25 '21 at 17:48
-
2You might consider the `fuzzyjoin` package. – Jon Spring Feb 25 '21 at 19:15
-
Inner join on means cross join where. Cross join means inner join on true. – philipxy Mar 01 '21 at 02:54
3 Answers
12
Non-equi joins are available since 1.1.0
via the function join_by
. To create non-equi joins, you can use <
, >
, >=
, <=
, or helpers between
, within
, overlaps
and nearest
.
library(dplyr)
#Example from https://github.com/tidyverse/dplyr/pull/5910.
set.seed(123)
dates <- as.Date("2019-01-01") + 0:4
needles <- tibble(dates = dates, x = sample(length(dates)))
set.seed(123)
lower <- as.Date("2019-01-01") + sample(6, 5, replace = TRUE)
upper <- lower + sample(2, 5, replace = TRUE)
haystack <- tibble(lower = lower, upper = upper, y = sample(length(lower)))
needles
#> # A tibble: 5 x 2
#> dates x
#> <date> <int>
#> 1 2019-01-01 3
#> 2 2019-01-02 2
#> 3 2019-01-03 5
#> 4 2019-01-04 4
#> 5 2019-01-05 1
haystack
#> # A tibble: 5 x 3
#> lower upper y
#> <date> <date> <int>
#> 1 2019-01-04 2019-01-06 1
#> 2 2019-01-07 2019-01-08 2
#> 3 2019-01-04 2019-01-05 3
#> 4 2019-01-03 2019-01-05 4
#> 5 2019-01-03 2019-01-05 5
# Non-equi join
# For each row in `needles`, find locations in `haystack` matching the condition
left_join(needles, haystack, by = join_by(dates >= lower, dates <= upper))
#> # A tibble: 12 x 5
#> dates x lower upper y
#> <date> <int> <date> <date> <int>
#> 1 2019-01-01 3 NA NA NA
#> 2 2019-01-02 2 NA NA NA
#> 3 2019-01-03 5 2019-01-03 2019-01-05 4
#> 4 2019-01-03 5 2019-01-03 2019-01-05 5
#> 5 2019-01-04 4 2019-01-04 2019-01-06 1
#> 6 2019-01-04 4 2019-01-04 2019-01-05 3
#> 7 2019-01-04 4 2019-01-03 2019-01-05 4
#> 8 2019-01-04 4 2019-01-03 2019-01-05 5
#> 9 2019-01-05 1 2019-01-04 2019-01-06 1
#> 10 2019-01-05 1 2019-01-04 2019-01-05 3
#> 11 2019-01-05 1 2019-01-03 2019-01-05 4
#> 12 2019-01-05 1 2019-01-03 2019-01-05 5

Maël
- 45,206
- 3
- 29
- 67
5
There has been a new option for this in dbplyr
since the version 1.4.0: sql_on
. Citing Kirill Müller:
There is #2240 for dplyr, but it's going to take a while. For databases we already have a workaround [ie. generic SQL joins].
library(dplyr)
library(dbplyr)
tbl1 <- memdb_frame(a = 1:3, b = 4:2)
tbl2 <- memdb_frame(c = 1:3, b = 2:0)
left_join(tbl1, tbl2, sql_on = "LHS.b < RHS.c")

Jarosław Nirski
- 61
- 1
- 4
2
For dbplyr: While SQL supports non-equi joins, I have not found a dplyr approach that is equivalent. My usual work around is very similar to the r-bloggers link posted by @Waldi to join on the equality conditions and then filter on the inequality conditions.
For example:
output = join(df1, df2, by = c("df1_id" = "df2_id")) %>%
filter(df1_date <= df2_date)
This translates to SQL similar to:
SELECT *
FROM df1
JOIN df2
ON df1_id = df2_id
WHERE df1_date <= df2_date
Which is not that different from:
SELECT *
FROM df1
JOIN df2
ON df1_id = df2_id
AND df1_date <= df2_date

Simon.S.A.
- 6,240
- 7
- 22
- 41
-
"Which is not that different from:" Thanks for the answer. Are there cases when the two differ? – dimid Feb 23 '22 at 09:50
-
2There may be some cases when using a left, right, or outer join that the two return different output. Suppose we used a left join and there was a record where `df1_date` was `NULL` - in an inequality join this record would be part of the output with nothing joined to it, but using the WHERE clause this record might be dropped. However, in these kinds of cases you probably want to handle `NULL` values explicitly. – Simon.S.A. Feb 23 '22 at 19:54