4

I want to use a character vector containing column names in by in data.table along with the interactive way of defining groups. The vector contains columns which are common across a few data.tables, but each data.table has a few unique columns. Is that possible? Example below.

library(data.table)
mtcarsdt <- data.table(mtcars)
bycols <- c('cyl', 'gear')   # Defined for use across multiple data.tables
mtcarsdt[
  , .(mpg = mean(mpg)),      # This does not work.
  by = c('carb%%2', bycols)  # How can I make this work?
]
mtcarsdt[
  , .(mpg = mean(mpg)), 
  by = .(carb%%2, cyl, gear) # This works
]
Ameya
  • 1,712
  • 1
  • 14
  • 29
  • `%%2` is an arbitrary example of an operation on a column, which is possible in `by` as long as you pass the quoted arguments in a list. For this example, `%%2` could be replaced by `>2` just as easily. – Ameya Mar 01 '19 at 01:02
  • @42- Why are the results incorrect? `mtcarsdt[cyl == 6 & gear == 4 & carb%%2 == 0, mean(mpg)]` yielded the same result (`19.75`) as the first row of the second query for me – Ameya Mar 01 '19 at 01:05
  • I didn't understand what you were doing. I've now figured it out. – IRTFM Mar 01 '19 at 01:11
  • an ugly approach: `mtcarsdt[, .(mpg = mean(mpg)), by=eval(parse(text=paste0("list(carb%%2,", paste(bycols, collapse=","), ")"))) ]` – chinsoon12 Mar 01 '19 at 03:29
  • Possible duplicate of [How can one work fully generically in data.table in R with column names in variables](https://stackoverflow.com/questions/24833247/how-can-one-work-fully-generically-in-data-table-in-r-with-column-names-in-varia) – jangorecki Mar 01 '19 at 04:54

3 Answers3

1

You could make a 3 way interaction vector as the by argument:

mtcarsdt[
    , .(mpg = mean(mpg)),      # This does not work.
    by =  interaction(mtcars$carb%%2, interaction( mtcars[, bycols]))  # How can I make this work?
    ]

    interaction      mpg
 1:       0.6.4 19.75000
 2:       1.4.4 29.10000
 3:       1.6.3 19.75000
 4:       0.8.3 14.63333
 5:       0.4.4 24.75000
 6:       1.8.3 16.30000
 7:       1.4.3 21.50000
 8:       0.4.5 28.20000
 9:       0.8.5 15.40000
10:       0.6.5 19.70000
IRTFM
  • 258,963
  • 21
  • 364
  • 487
  • Thank you. Did not know about `interaction`. But is there any easy way to retrieve the original labels of the variables? (Or is that a separate question?) – Ameya Mar 01 '19 at 01:26
  • (Dunno. Maybe?) The trick was to find a mechanism for each of the columns to be referred to using the same datatype. Your problem was to combine access using expressions using R symbols for columns with access via character vector reference to column names. There are so many different ways of formulating a `by` reference in data.table but I think there wasn't an effort to allow mixing of these reference-types. – IRTFM Mar 01 '19 at 01:31
  • Thanks! I'll wait for for a day and accept your answer if there are no better answers. – Ameya Mar 01 '19 at 01:35
1

Here's one very literal way of doing it:

mtcarsdt[, .(mpg = mean(mpg)), by = eval(as.call(parse(text = c(".", bycols, "carb %% 2"))))]
#    cyl gear carb      mpg
# 1:   6    4    0 19.75000
# 2:   4    4    1 29.10000
# 3:   6    3    1 19.75000
# 4:   8    3    0 14.63333
# 5:   4    4    0 24.75000
# 6:   8    3    1 16.30000
# 7:   4    3    1 21.50000
# 8:   4    5    0 28.20000
# 9:   8    5    0 15.40000
#10:   6    5    0 19.70000

Another option is constructing the whole expression and eval/parsing it:

bycols = "cyl, gear"
eval(parse(text = paste0('mtcarsdt[, .(mpg = mean(mpg)), by = .(carb %% 2, ', bycols, ')]')))

You can also play the same tricks using eval/quote.

And if you don't care about keeping bycols columns as columns, and mostly care about the grouping, you can also do:

mtcarsdt[, byvals := paste(.BY, collapse = ","), by = bycols][
         , .(mpg = mean(mpg)), by = .(byvals, carb %% 2)]
#    byvals carb      mpg
# 1:    6,4    0 19.75000
# 2:    4,4    1 29.10000
# 3:    6,3    1 19.75000
# 4:    8,3    0 14.63333
# 5:    4,4    0 24.75000
# 6:    8,3    1 16.30000
# 7:    4,3    1 21.50000
# 8:    4,5    0 28.20000
# 9:    8,5    0 15.40000
#10:    6,5    0 19.70000
eddi
  • 49,088
  • 6
  • 104
  • 155
0

this seems like an issue of splicing and evaluating bycols in a given environment. I am not too familiar with the data.table package. But since there are other answers I figured I can give an alternative process that does what you require. The trick is to use the rlang !!! operator with syms. What this does is splices the bycols vector and evaluates it. Easily done with dplyr grouping and summarising.

library(dplyr)
library(rlang)
bycols <- c("cyl", "gear")
mtcarsdt %>% mutate(carb2 = carb%%2) %>% 
  group_by(carb2, !!! syms(bycols)) %>% 
  summarise(m_mpg = mean(mpg))

Now bycols can be whatever columns you like.

Croote
  • 1,382
  • 1
  • 7
  • 15