Non-equi join in tidyverse

Question

I was wondering whether someone knows if the dplyr extension packages (dbplyr and dtplyr) allow non-equi joins within the usual dplyr workflow? I rarely need data.table, but fast non-equi joins are the only moments where I always need to setDT, then perform the join and then switch back with as_tibble(). I scrolled through the issues in the packages on github but didn't find whether this is something that's planned or already implemented.

Interesting, but not yet pushed forward : https://github.com/tidyverse/dtplyr/issues/133 — Waldi, Feb 25 '21 at 17:40
https://www.r-bloggers.com/2021/02/the-unequalled-joy-of-non-equi-joins/ — Waldi, Feb 25 '21 at 17:48
Inner join on means cross join where. Cross join means inner join on true. — philipxy, Mar 01 '21 at 02:54

Maël · Accepted Answer · 2023-01-30T16:08:05.977

Non-equi joins are available since 1.1.0 via the function join_by. To create non-equi joins, you can use <, >, >=, <=, or helpers between, within, overlaps and nearest.

library(dplyr)

#Example from https://github.com/tidyverse/dplyr/pull/5910.

set.seed(123)
dates <- as.Date("2019-01-01") + 0:4
needles <- tibble(dates = dates, x = sample(length(dates)))

set.seed(123)
lower <- as.Date("2019-01-01") + sample(6, 5, replace = TRUE)
upper <- lower + sample(2, 5, replace = TRUE)
haystack <- tibble(lower = lower, upper = upper, y = sample(length(lower)))

needles
#> # A tibble: 5 x 2
#>   dates          x
#>   <date>     <int>
#> 1 2019-01-01     3
#> 2 2019-01-02     2
#> 3 2019-01-03     5
#> 4 2019-01-04     4
#> 5 2019-01-05     1
haystack
#> # A tibble: 5 x 3
#>   lower      upper          y
#>   <date>     <date>     <int>
#> 1 2019-01-04 2019-01-06     1
#> 2 2019-01-07 2019-01-08     2
#> 3 2019-01-04 2019-01-05     3
#> 4 2019-01-03 2019-01-05     4
#> 5 2019-01-03 2019-01-05     5

# Non-equi join
# For each row in `needles`, find locations in `haystack` matching the condition
left_join(needles, haystack, by = join_by(dates >= lower, dates <= upper))
#> # A tibble: 12 x 5
#>    dates          x lower      upper          y
#>    <date>     <int> <date>     <date>     <int>
#>  1 2019-01-01     3 NA         NA            NA
#>  2 2019-01-02     2 NA         NA            NA
#>  3 2019-01-03     5 2019-01-03 2019-01-05     4
#>  4 2019-01-03     5 2019-01-03 2019-01-05     5
#>  5 2019-01-04     4 2019-01-04 2019-01-06     1
#>  6 2019-01-04     4 2019-01-04 2019-01-05     3
#>  7 2019-01-04     4 2019-01-03 2019-01-05     4
#>  8 2019-01-04     4 2019-01-03 2019-01-05     5
#>  9 2019-01-05     1 2019-01-04 2019-01-06     1
#> 10 2019-01-05     1 2019-01-04 2019-01-05     3
#> 11 2019-01-05     1 2019-01-03 2019-01-05     4
#> 12 2019-01-05     1 2019-01-03 2019-01-05     5

Jarosław Nirski · Answer 2 · 2021-11-26T15:55:16.837

5

There has been a new option for this in dbplyr since the version 1.4.0: sql_on. Citing Kirill Müller:

There is #2240 for dplyr, but it's going to take a while. For databases we already have a workaround [ie. generic SQL joins].

library(dplyr)
library(dbplyr)
tbl1 <- memdb_frame(a = 1:3, b = 4:2)
tbl2 <- memdb_frame(c = 1:3, b = 2:0)
left_join(tbl1, tbl2, sql_on = "LHS.b < RHS.c")

edited Nov 26 '21 at 15:55

answered Nov 26 '21 at 15:47

Jarosław Nirski

61
1
4

score 2 · Answer 3 · answered Feb 25 '21 at 21:37

2

For dbplyr: While SQL supports non-equi joins, I have not found a dplyr approach that is equivalent. My usual work around is very similar to the r-bloggers link posted by @Waldi to join on the equality conditions and then filter on the inequality conditions.

For example:

output = join(df1, df2, by = c("df1_id" = "df2_id")) %>%
  filter(df1_date <= df2_date)

This translates to SQL similar to:

SELECT *
FROM df1
JOIN df2
ON df1_id = df2_id
WHERE df1_date <= df2_date

Which is not that different from:

SELECT *
FROM df1
JOIN df2
ON df1_id = df2_id
AND df1_date <= df2_date

answered Feb 25 '21 at 21:37

Simon.S.A.

6,240
7
22
41

"Which is not that different from:" Thanks for the answer. Are there cases when the two differ? – dimid Feb 23 '22 at 09:50
2

There may be some cases when using a left, right, or outer join that the two return different output. Suppose we used a left join and there was a record where `df1_date` was `NULL` - in an inequality join this record would be part of the output with nothing joined to it, but using the WHERE clause this record might be dropped. However, in these kinds of cases you probably want to handle `NULL` values explicitly. – Simon.S.A. Feb 23 '22 at 19:54

Non-equi join in tidyverse

3 Answers3

Linked

Related