2

If I run the following trivial example, I get the expected output:

library(dplyr)
library(dtplyr)
library(data.table)

dt1 <- lazy_dt(data.table(a = 1:5, b = 6:10))
dt2 <- lazy_dt(data.table(a = letters[1:5], b = 6:10))

dt1 %>%
  left_join(
    dt2,
    by = "b"
  ) %>%
  as.data.table()
>     b a.x a.y
> 1:  6   1   a
> 2:  7   2   b
> 3:  8   3   c
> 4:  9   4   d
> 5: 10   5   e

Note that the conflicting columns a are properly managed, using the standard dplyr format of adding .x and .y suffixes.

However, if I now try to drop one of the columns:

dt1 %>%
  left_join(
    dt2,
    by = "b"
  ) %>%
  select(
    -a.y
  ) %>%
  as.data.table()
> Error in is_character(x) : object 'a.y' not found

Interestingly, if I try to select one of the a columns (select(a.x)), I get the same error, but... if I instead try select(a) (selecting a column which shouldn't really exist anymore), I get the following output:

dt1 %>%
  left_join(
    dt2,
    by = "b"
  ) %>%
  select(
    a
  ) %>%
  as.data.table()
>    a.b
> 1:   1
> 2:   2
> 3:   3
> 4:   4
> 5:   5

where the selected column is clearly dt1$a, but for some reason the given column name is a.b. (if I try select(a.b), I get the same object not found error).

Meanwhile, if I try to drop a, both a columns are dropped:

dt1 %>%
  left_join(
    dt2,
    by = "b"
  ) %>%
  select(
    -a
  ) %>%
  as.data.table()
>     b
> 1:  6
> 2:  7
> 3:  8
> 4:  9
> 5: 10

So, how can I use select with joins where the tables have conflicting (and not joined-by) columns?

EDIT:

As mentioned in some answers, I can obviously execute the lazy evaluation before the select, which works. However, it throws a warning (since I'd like to keep my object as a data.table, not a data.frame) so it doesn't seem to be the intended method:

dt1 %>%
  left_join(
    dt2,
    by = "b"
  ) %>%
  as.data.table() %>%
  select(
    -a.x
  )
>     b a.y
> 1:  6   a
> 2:  7   b
> 3:  8   c
> 4:  9   d
> 5: 10   e
> Warning message:
> You are using a dplyr method on a raw data.table, which will call the data 
> frame implementation, and is likely to be inefficient.
> * 
> * To suppress this message, either generate a data.table translation with
> * `lazy_dt()` or convert to a data frame or tibble with
> * `as.data.frame()`/`as_tibble()`.
Wasabi
  • 2,879
  • 3
  • 26
  • 48

3 Answers3

3

The key piece is that dtplyr uses lazy evaluation. More here https://dtplyr.tidyverse.org/, but the key piece is:

Compared to the previous release, this version of dtplyr is a complete rewrite that focusses only on lazy evaluation triggered by use of lazy_dt(). This means that no computation is performed until you explicitly request it with as.data.table(), as.data.frame() or as_tibble().

In your example, this means that the join hasn't been evaluated before the select, they are being translated and waiting to be evaluated. The translation mechanism allows dtplyr to combine multiple verbs more efficiently into one action. (More on translation here: https://dtplyr.tidyverse.org/articles/translation.html)

There are a few ways to approach the problem. The simplest is to move up the evaluation using as.data.frame().

dt1 %>%
  left_join(
    dt2,
    by = "b"
  ) %>%
  as.data.frame() %>%
  select(-a.y)

Another way is to get the data.table object, and then subset the columns using data.table syntax.

ravic_
  • 1,731
  • 9
  • 13
  • Yes, I'm currently doing precisely that (though it now throws a warning). I'm also aware that `dtplyr` uses lazy evaluation (it's right there on the name `lazy_dt` :p ). But since `lazy_dt` simply accumulates all the calls until `as.data.table()`, I expected it to accumulate that `select` request as well and hope such columns exist once the object is actually evaluated. – Wasabi Nov 19 '19 at 16:53
  • Apologies, didn't understand your question's intent. There is more info provided in the translations vignette (https://dtplyr.tidyverse.org/articles/translation.html) that may inform how those verbs are eventually translated. Perhaps that will be of help? – ravic_ Nov 19 '19 at 16:59
  • 1
    Updated my answer with a bit about translation, but hopefully someone from the community can put in a better answer. It's a worthy question. – ravic_ Nov 19 '19 at 17:23
1

Apparently, left_join does not work with data.table but works with data.frame (I didn't know it before).

Therefore, one solution could be to do:

library(dplyr)
library(dtplyr)
library(data.table)

dt1 <- lazy_dt(data.table(a = 1:5, b = 6:10))
dt2 <- lazy_dt(data.table(a = letters[1:5], b = 6:10))

as.data.frame(dt1) %>%
  left_join(as.data.frame(dt2), by = "b") %>%
  select(-a.y) %>%
  as.data.table()
bretauv
  • 7,756
  • 2
  • 20
  • 57
0

This is a bug in the current release of dtplyr (1.0.0), but has now been fixed in the development version.

Wasabi
  • 2,879
  • 3
  • 26
  • 48