Translating dplyr to data.table

Question

So I am trying to translate some dplyr code. I have tried to get help from a package that translates dplyr to data.table but it still does not work. The error is with row_number from dplyr..

I need all the steps in the dplyr code (even though they don't make sense here with mtcars)

library(dplyr)
library(dtplyr) # from https://github.com/tidyverse/dtplyr
library(data.table)

mtcars %>% 
  distinct(mpg, .keep_all = TRUE) %>% 
  group_by(am) %>% 
  arrange(mpg, .by_group = TRUE) %>% 
  mutate(row_num = LETTERS[row_number()]) %>% 
  ungroup() 

# using dtplyr
dt <- lazy_dt(mtcars)

dt %>% 
  distinct(mpg, .keep_all = TRUE) %>% 
  group_by(am) %>% 
  arrange(mpg, .by_group = TRUE) %>% 
  mutate(row_num = LETTERS[row_number()]) %>% 
  ungroup() %>% 
  show_query()
#> unique(`_DT1`, by = "mpg")[order(am, mpg)][, `:=`(row_num = c("A", 
#> "B", "C", "D", "E", "F", "G", "H", "I", "J", "K", "L", "M", "N", 
#> "O", "P", "Q", "R", "S", "T", "U", "V", "W", "X", "Y", "Z")[row_number()]), 
#>     keyby = .(am)]

# I then use the query from dtplyr 
DT <- as.data.table(mtcars)
unique(DT, by = "mpg")[order(am, mpg)][, `:=`(row_num = c("A", 
                                                              "B", "C", "D", "E", "F", "G", 
                                                              "H", "I", "J", "K", "L", "M", 
                                                              "N", "O", "P", "Q", "R", "S", 
                                                              "T", "U", "V", "W", "X", "Y", 
                                                              "Z")[row_number()]), keyby = .(am)]

#> row_number() should only be called in a data context

^{Created on 2019-07-14 by the reprex package (v0.3.0)}

score 8 · Answer 1 · answered Jul 14 '19 at 00:19

8

Might I recommend the rowid function? It does the grouping step "under the hood" you might find it looks cleaner:

unique(DT, by='mpg')[order(am, mpg), row_num := LETTERS[rowid(am)]]

if you love chaining, you could also get everything inside []:

DT[ , .SD[1L], by = mpg
   ][order(am, mpg), row_num := LETTERS[rowid(am)]]

answered Jul 14 '19 at 00:19

MichaelChirico

33,841
14
113
198

I think you need to separate `i` and `j` in the first expression otherwise it modifies-in-place the copy produced by `unique()`? (That's certainly what my experiments suggest). And I think `rowid()` is different to `row_number()`? I'm translating to `frank(x, ties.method = "first", na.last = "keep")`. – hadley Jul 15 '19 at 15:59
Oh, you just need a `[]` on the end, or to assign it to a variable? – hadley Jul 15 '19 at 16:08
@hadley there shouldn't be a material difference in separating `i`/`j` in the first case -- yes, `:=` is modifying by reference, but it's modifying an object that's not yet assigned (the result of `unique()`). Deferring to a second `[` would assign by reference to a different object that's not yet assigned (the result of `unique()[]`) – MichaelChirico Jul 15 '19 at 22:34
Re: `row_number()` oh! i didn't realize `row_number()` is potentially non-unique by default. I guess `rowid()` is equivalent to `dense_rank()`? Your use of `frank` sounds correct at a glance. – MichaelChirico Jul 15 '19 at 22:37

hadley · Answer 2 · 2019-07-15T16:06:01.377

7

I'm experimenting with some tweaks to the translation so that dtplyr will automatically produce something more like what you want:

library(dtplyr)
library(dplyr, warn.conflicts = FALSE)

dt <- lazy_dt(mtcars)

dt %>% 
  distinct(mpg, .keep_all = TRUE) %>% 
  group_by(am) %>% 
  arrange(mpg, .by_group = TRUE) %>% 
  mutate(row_num = LETTERS[row_number()]) %>% 
  ungroup() %>% 
  show_query()
#> unique(`_DT1`, by = "mpg")[order(am, mpg)][, `:=`(row_num = ..LETTERS[seq_len(.N)]), 
#>    keyby = .(am)]

Or avoiding the grouping as @MichaelChirico suggests:

dt %>% 
  distinct(mpg, .keep_all = TRUE) %>% 
  arrange(am, mpg) %>% 
  mutate(row_num = LETTERS[row_number(am)]) %>% 
  ungroup() %>% 
  show_query()
#> unique(`_DT1`, by = "mpg")[order(am, mpg)][, `:=`(row_num =  ..LETTERS[frank(am, 
#>    ties.method = "first", na.last = "keep")])]

(Using the .. in front of LETTERS is a data.table feature that makes it clear that you're referring to a variable outside of the data frame; it's probably not necessary here but I think it's better to be safe than sorry.)

edited Jul 15 '19 at 16:06

answered Jul 15 '19 at 15:43

hadley

102,019
32
183
245

Thanks. Also thanks for all your books and packages. Is it aslo possible to include translation of ``spread()`` and ``gather()`` at some point also? – xhr489 Jul 15 '19 at 17:03
1

I'm not sure which the OP wants, but there is a behavioral difference between Michael's `DT[order(...), v := stuff]` and your `DT[order(...)][, v := stuff]` -- the latter is a permanent reordering that creates a new table, while the former is just temporary for the assignment within the existing table. I suspect dplyr would need new verbs to cover that case... mutate_on_filter, mutate_on_slice, mutate_on_arrange. (cont..) – Frank Jul 15 '19 at 18:57
1

(..cont) For permanent reordering `x[order(...)]` would idiomatically and more efficiently be `setorder(x, ...)`, though that may make the chaining less readable. (In practice, I think most people would put it on multiple lines rather than chaining, fwiw, so the show_query still wouldn't quite be idiomatic.) Since unique() already creates a new table, there is no need for arrange to also make a copy / create a new table. Edit: I'm kind of surprised `frank` shows up in translation of row_number instead of `rowid` or seq_len(.N). – Frank Jul 15 '19 at 18:58
Obviously dtplyr can only translate existing dplyr verbs, so I think translating `arrange()` to `x[order(...)]` most accurately captures the intent of the original code. I am also surprised that `frank()` is needed, but that's what was revealed by unit tests. Grouped `rowid()` does something that I don't expect (and didn't spend the time to figure out). – hadley Jul 18 '19 at 20:33

score 5 · Accepted Answer · answered Jul 13 '19 at 23:45

5

We can use seq_len(.N)

unique(DT, by = "mpg")[order(am, mpg)][, 
     `:=`(row_num = LETTERS[seq_len(.N)]), by = .(am)][]

answered Jul 13 '19 at 23:45

akrun

874,273
37
540
662

score 1 · Answer 4 · answered Jul 15 '19 at 19:40

As the data.table syntax is heavily critizised below are two versions of akrun answer with IMHO a clearer syntax.

I find much harder to follow data.table code when it's piped with [ multiple times, specially when there is a := call (mutate in dplyr).

library(data.table)
dt = as.data.table(mtcars)

dt = unique(dt, by = "mpg")
dt = dt[order(am, mpg)]
dt[, row_num:=LETTERS[seq_len(.N)], by=am]
dt[1:3]

    mpg cyl disp  hp drat   wt  qsec vs am gear carb row_num
1: 10.4   8  472 205 2.93 5.25 17.98  0  0    3    4       A
2: 13.3   8  350 245 3.73 3.84 15.41  0  0    3    4       B
3: 14.3   8  360 245 3.21 3.57 15.84  0  0    3    4       C

Another option is using the %>% pipe.

library(magrittr)

dt = as.data.table(mtcars)
dt = unique(dt, by = "mpg") %>%
  .[order(am, mpg)] %>%
  .[, row_num:=LETTERS[seq_len(.N)], by=am]
dt[1:3]

#     mpg cyl disp  hp drat   wt  qsec vs am gear carb row_num
# 1: 10.4   8  472 205 2.93 5.25 17.98  0  0    3    4       A
# 2: 13.3   8  350 245 3.73 3.84 15.41  0  0    3    4       B
# 3: 14.3   8  360 245 3.21 3.57 15.84  0  0    3    4       C

Is it necessary to order by 'am' ? Because 'am' is going to be used in the 'by' statement. — xhr489, Jul 15 '19 at 19:46
That's in your code, in the `arrange` call. It changes the output if you don't order. — marbel, Jul 15 '19 at 19:49

Translating dplyr to data.table

4 Answers4