R data.table - how to use assigned variables as column names for computing summaries _and_ grouping

Question

The problem is well-known: unlike data.frame's, where one can point to column names by character variables, the default behaviour of data.table is to want actual column names (e.g. you cannot do DT[, "X"], but you must do DT[, X], if your table has a column named "X").
Which in some cases is a problem, because one wants to handle a generic dataset with arbitrary, user-defined column names.

I saw a couple of posts about this:

Pass column name in data.table using variable

Select / assign to data.table when variable names are stored in a character vector

And the official FAQ says I should use with = FALSE:

https://cran.r-project.org/web/packages/data.table/vignettes/datatable-faq.html#i-assigned-a-variable-mycol-x-but-then-dt-mycol-returns-x.-how-do-i-get-it-to-look-up-the-column-name-contained-in-the-mycol-variable

The quote + eval method, I do not really understand; and the one with .. gave an error even before starting doing anything.
So I only compared the method using the actual column names (which I could not use in real practice), the one using get and the one using with = FALSE.

Interestingly, the latter, i.e. the official, recommended one, is the only one that does not work at all.
And get, while it works, for some reason is far slower than using the actual column names, which I really don't get (no pun intended).
So I guess I am doing something wrong...

Incidentally, but importantly, I turned to data.table because I needed to make a grouped mean of a fairly large dataset, and my previous attempts using aggregate, by or tapply were either too slow, or too memory-hungry, and they crashed R.

I cannot disclose the actual data I am working with, so I made a simulated dataset of the same size here:

require(data.table)

row.var = "R"
col.var = "C"
value.var = "V"

set.seed(934293)

d <- setNames(data.frame(sample(1:758145, 7582953, replace = T), sample(1:450, 7582953, replace = T), runif(7582953, 5, 9)),
              c(row.var, col.var, value.var)) 

DT <- as.data.table(d)

rm(m)
print(system.time({
  m <- DT[, mean(V), by = .(R, C)]
}))

#   user  system elapsed 
#   1.64    0.27    0.51 

rm(m)
print(system.time({
  m <- DT[, mean(get(value.var)), by = .(get(row.var), get(col.var))]
}))

#   user  system elapsed 
#  16.05    0.02   14.97 

rm(m)
print(system.time({
  m <- DT[, mean(value.var), by = .(row.var, col.var), with = FALSE]
}))

#Error in h(simpleError(msg, call)) : 
#  error in evaluating the argument 'x' in selecting a method for function 'print': missing value #where TRUE/FALSE needed
#In addition: Warning message:
#In mean.default(value.var) :
# 
# Error in h(simpleError(msg, call)) : 
#error in evaluating the argument 'x' in selecting a method for function 'print': missing value #where TRUE/FALSE needed Timing stopped at: 0 0 0

Any ideas?

score 1 · Answer 1 · edited Jun 19 '22 at 20:55

collap from collapse gives a better timing

library(collapse)
> system.time(collap(DT, reformulate(c(row.var, col.var),
       response = value.var), fmean))
   user  system elapsed 
  0.881   0.020   0.901 

> system.time(fmean(fgroup_by(DT, c(row.var, col.var))))
   user  system elapsed 
  0.931   0.021   0.952 
> system.time(DT[, mean(V), by = .(R, C)])
   user  system elapsed 
  5.052   0.099   0.646

As the get approach or the one with .SDcols is taking time, another approach is to interpolate the values in a string and evaluate

system.time(eval(parse(text = glue::glue("DT[, mean({value.var}), by = .({row.var}, {col.var})]"))))
   user  system elapsed 
  5.065   0.105   0.660

-checking the output

> out_c <- collap(DT, reformulate(c(row.var, col.var), 
       response = value.var), fmean)
> out_d <- DT[, mean(V), by = .(R, C)]
> out_dte <- eval(parse(text = glue::glue("DT[, mean({value.var}), by = .({row.var}, {col.var})]")))
> out_c
              R     C        V
          <int> <int>    <num>
      1:      1    16 5.237421
      2:      1    77 5.080965
      3:      1    95 5.822834
      4:      1   107 7.276902
      5:      1   224 5.565160
     ---                      
7499419: 758145   162 8.077307
7499420: 758145   181 6.094559
7499421: 758145   251 7.816277
7499422: 758145   310 8.373066
7499423: 758145   435 8.222885
> out_d[order(R, C)]
              R     C       V1
          <int> <int>    <num>
      1:      1    16 5.237421
      2:      1    77 5.080965
      3:      1    95 5.822834
      4:      1   107 7.276902
      5:      1   224 5.565160
     ---                      
7499419: 758145   162 8.077307
7499420: 758145   181 6.094559
7499421: 758145   251 7.816277
7499422: 758145   310 8.373066
7499423: 758145   435 8.222885

> out_dte[order(R, C)]
              R     C       V1
          <int> <int>    <num>
      1:      1    16 5.237421
      2:      1    77 5.080965
      3:      1    95 5.822834
      4:      1   107 7.276902
      5:      1   224 5.565160
     ---                      
7499419: 758145   162 8.077307
7499420: 758145   181 6.094559
7499421: 758145   251 7.816277
7499422: 758145   310 8.373066
7499423: 758145   435 8.222885

Once we get the output, the column names can be updated with setnames

> setnames(out_dte[order(R, C)], 'V1', value.var)[]
              R     C        V
          <int> <int>    <num>
      1:      1    16 5.237421
      2:      1    77 5.080965
      3:      1    95 5.822834
      4:      1   107 7.276902
      5:      1   224 5.565160
     ---                      
7499419: 758145   162 8.077307
7499420: 758145   181 6.094559
7499421: 758145   251 7.816277
7499422: 758145   310 8.373066
7499423: 758145   435 8.222885

Thanks akrun, this is very interesting. Yet another package to install and learn to use though. I am OK with data.table's performance for my set. only I do not get why I cannot use assigned variable names. The easiest workaround is of course to rename the columns and write a hard-coded expression; that I can do even in data.table. But it seems pretty absurd to me that one should not be able to just use variables for column names without multiplying by 10 the run time (and I don't even see why that should happen in fact). — user6376297, May 03 '22 at 06:47
@user6376297 I did mention about using `.SDcols` and subsetting with `.SD`, but that seems to be a bit slower compared to the original call. Regarding renaming inside the `.(` on the `lhs` of `=`, it is not evaluated, ie. the reason we use either `setnames` or `setNames`. With `dplyr`, you can rename with `lst(!! yourobj :=``, but that is for list elements — akrun, May 03 '22 at 16:15
OK, thanks akrun. I resolved the problem by renaming the relevant columns forth and back, using just base R commands. I am still amazed at how a package (`data.table`) can be so good, and make such a positive impact on performance, but then stumble so spectacularly on the simple programmatic need to use generic names for columns, which is totally standard for data.frame's (not sure about SQL, but I imagine there could be a way around there). — user6376297, May 04 '22 at 06:22

Uwe · Answer 2 · 2022-05-05T18:49:15.863

This particular problem of programming data.tables can be solved without get() at least in two different ways:

1. Using `.SDcols` and passing character values to `by`

Here, .SDcols takes a vector of character column names to operate on. by = accepts also a character vector of column names.

DT[, lapply(.SD, mean), .SDcols = value.var, by = c(row.var, col.var)]

2. Using the new `env` parameter

With development version 1.14.3. data.table has gained a new interface for programming on data.table (see item 10 in the Changelog).

data.table::update.dev.pkg() # Install latest dev version (1.14.3)
library(data.table)

DT[, mean(v1), by = .(v2, v3), env = list(v1 = value.var, v2 = row.var, v3 = col.var)]

Benchmarks

microbenchmark::microbenchmark(
  nat = DT[, mean(V), by = .(R, C)],
  # get = DT[, mean(get(value.var)), by = .(get(row.var), get(col.var))],
  chr = DT[, lapply(.SD, mean), .SDcols = value.var, by = c(row.var, col.var)],
  env = DT[, mean(v1), by = .(v2, v3), env = list( v1 = value.var, v2 = row.var, v3 = col.var)],
  times =  3L,
  check = "equivalent"
)

Unit: seconds
 expr      min       lq     mean   median       uq      max neval
  nat 1.275479 1.313737 1.356826 1.351995 1.397500 1.443005     3
  chr 1.279219 1.297576 1.328261 1.315933 1.352782 1.389630     3
  env 1.324985 1.327743 1.331603 1.330502 1.334912 1.339323     3

All three variants (except get) took approximately the same time. get was excluded as the run time was too long.

Edit: Benchmarks results including get:

Unit: seconds
 expr        min         lq       mean     median         uq        max neval
  nat   1.238719   1.288629   1.315594   1.338539   1.354032   1.369525     3
  get 569.560843 572.417951 576.482720 575.275059 579.943658 584.612257     3
  chr   1.275734   1.279536   1.309346   1.283338   1.326153   1.368967     3
  env   1.298941   1.316105   1.326649   1.333268   1.340503   1.347738     3

So, get takes about 500 times longer than the other three variants.

An explanation for this behaviour is given by data.table when the verbose option is switched on:

options(datatable.verbose = TRUE)

Now,

DT[, mean(get(value.var)), by = .(get(row.var), get(col.var))]

gives the following hints:

Argument 'by' after substitute: .(get(row.var), get(col.var))
'(m)get' found in j. ansvars being set to all columns. Use .SDcols or a single j=eval(macro) instead. Both will detect the columns used which is important for efficiency.
Old ansvars: [R, C, V]
New ansvars: [R, C, V]
Finding groups using forderv ... forder.c received 7582953 rows and 2 columns 0.600s elapsed (0.720s cpu)
Finding group sizes from the positions (can be avoided to save RAM) ... 0.070s elapsed (0.060s cpu)
Getting back original order ... forder.c received a vector type 'integer' length 7499423 0.470s elapsed (0.640s cpu)
lapply optimization is on, j unchanged as 'mean(get(value.var))'
GForce is on, left j unchanged
Old mean optimization changed j from 'mean(get(value.var))' to '.External(Cfastmean, get(value.var), FALSE)'
Making each group and running j (GForce FALSE) ...
[...]

The 3 other variants are all using data.table's gforce optimization.

R data.table - how to use assigned variables as column names for computing summaries _and_ grouping

2 Answers2

1. Using .SDcols and passing character values to by

2. Using the new env parameter

Benchmarks

1. Using `.SDcols` and passing character values to `by`

2. Using the new `env` parameter