3

How do you do non standard joins ('col1' != 'col2') in dplyr when you are working with a database.

Example:

Setting up the database:

library(dplyr)

con <- DBI::dbConnect(RSQLite::SQLite(), ":memory:")

tableA <- data.frame(col1= c("a","b","c","d"),
                     col2 = c(1,2,3,4))

copy_to(con, tableA)

Here is the sql join I would like to do using dplyr code when working with the database:

SQL code:

tbl(con, sql("select a.col1, b.col2
              from 
              tableA as a
              inner join 
              tableA as b
              on a.col1 <> b.col1")) %>% 
 arrange(col1, col2)

Results:

# Source:     SQL [?? x 2]
# Database:   sqlite 3.19.3 [:memory:]
# Ordered by: col1, col2
     col1  col2
    <chr> <dbl
1     a     2
2     a     3
3     a     4
4     b     1
5     b     3
6     b     4
7     c     1
8     c     2
9     c     4
10     d     1
# ... with more rows

dplyr code:

Here is the initial dplyr code I have attempted when recreating the sql code above:

tbl(con,"tableA")->dbtableA

dbtableA %>% 
  inner_join(dbtableA, by = c('col1' != 'col1')) %>% 
  select(col1, col2=col2.x) %>% 
  arrange(col1, col2)

Error: by must be a (named) character vector, list, or NULL for natural joins (not recommended in production code), not logical

When attempting to solve this using tidyr package (code below) I get an error message:

library(tidyr)

dbtableA %>% 
  expand(col1,col2) %>% 
  left_join(dbtableA, by = 'col1') %>%
  filter(col2.x != col2.y) %>% 
  select(col1, col2 = col2.x) %>% 
  arrange(col1, col2)

Error: in UseMethod("expand_") : no applicable method for 'expand_' applied to an object of class "c('tbl_dbi', 'tbl_sql', 'tbl_lazy', 'tbl')"

Does anyone know how to code up this join in dplyr code when using a database? Many thanks in advance.

Dyfan Jones
  • 229
  • 2
  • 9

1 Answers1

4

I think you don't understand the by argument correctly.

In by = c("col1" = "col2"), = is not and equality operator, but an assignment operator (the equality operator in R is ==). The expression inside c(...) creates a named character vector (name: col1 value: col2) that dplyr uses for the join. Nowhere do you define the kind of comparison that is made during the join, the comparison is hard-coded in dplyr. I don't think dplyr supports non-equi joins (yet).

In by = c("col1" != "col2"), != is the inequality operator. So your statement is identical to writing by = TRUE (which is nonsensical).

Also take a look at this question for more discussion on the topic of non-equi joins in dplyr.

Stefan F
  • 2,573
  • 1
  • 17
  • 19