23

Given the data.table dat:

dat <- data.table(x_one=1:10, x_two=1:10, y_one=1:10, y_two=1:10) 

I'd like a function that creates an expression between two like rows given their "root" name, e.g. x_one - x_two.

myfun <- function(name) {
  one <- paste0(name, '_one')
  two <- paste0(name, '_two')

  parse(text=paste(one, '-', two))
}

Now, using just one root name works as expected and results in a vector.

dat[, eval(myfun('x')),]

[1] 0 0 0 0 0 0 0 0 0 0

However, trying to assign that output a name using the list technique fails:

dat[, list(x_out = eval(myfun('x'))),]

Error in eval(expr, envir, enclos) : object 'x_one' not found

I can "solve" this by adding a with(dat, ...) but that hardly seems data.table-ish

dat[, list(x_out = with(dat, eval(myfun('x'))),
           y_out = with(dat, eval(myfun('y')))),]

    x_out y_out
 1:     0     0
 2:     0     0
 3:     0     0
 4:     0     0
 5:     0     0
 6:     0     0
 7:     0     0
 8:     0     0
 9:     0     0
10:     0     0

What is the proper way to generate and evaluate these expressions if I want an output like I have above?

In case it helps, sessionInfo() output is below. I recall being able to do this, or something close to it, but its been awhile and data.table is updated since...

R version 2.15.1 (2012-06-22)

Platform: x86_64-pc-linux-gnu (64-bit)

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8    LC_PAPER=C                 LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] graphics  grDevices utils     datasets  stats     grid      methods   base     

other attached packages:
 [1] Cairo_1.5-1      zoo_1.7-7        stringr_0.6.1    doMC_1.2.5       multicore_0.1-7  iterators_1.0.6  foreach_1.4.0   
 [8] data.table_1.8.2 circular_0.4-3   boot_1.3-5       ggplot2_0.9.1    reshape2_1.2.1   plyr_1.7.1      

loaded via a namespace (and not attached):
 [1] codetools_0.2-8    colorspace_1.1-1   dichromat_1.2-4    digest_0.5.2       labeling_0.1       lattice_0.20-6    
 [7] MASS_7.3-20        memoise_0.1        munsell_0.3        proto_0.3-9.2      RColorBrewer_1.0-5 scales_0.2.1      
[13] tools_2.15.1      
Justin
  • 42,475
  • 9
  • 93
  • 111

2 Answers2

19

One solution is to put the list(...) within the function output.

I tend to use as.quoted, stealing from the way @hadley implements .() in the plyr package.

library(data.table)
library(plyr)
dat <- data.table(x_one=1:10, x_two=1:10, y_one=1:10, y_two=1:10) 
myfun <- function(name) {
  one <- paste0(name, '_one')
  two <- paste0(name, '_two')
  out <- paste0(name,'_out')
 as.quoted(paste('list(',out, '=',one, '-', two,')'))[[1]]
}


dat[, eval(myfun('x')),]

#    x_out
# 1:     0
# 2:     0
# 3:     0
# 4:     0
# 5:     0
# 6:     0
# 7:     0
# 8:     0
# 9:     0
#10:     0

To do two columns at once you can adjust your call

myfun <- function(name) {
  one <- paste0(name, '_one')
  two <- paste0(name, '_two')
  out <- paste0(name,'_out')
  calls <- paste(paste(out, '=', one, '-',two), collapse = ',')


 as.quoted(paste('list(', calls,')'))[[1]]
}


dat[, eval(myfun(c('x','y'))),]

#   x_out y_out
# 1:     0     0
# 2:     0     0
# 3:     0     0
# 4:     0     0
# 5:     0     0
# 6:     0     0
# 7:     0     0
# 8:     0     0
# 9:     0     0
# 0:     0     0

As for the reason.....

in this solution the entire call to 'list(..) is evaluated within the parent.frame being the data.table.

The relevant code within [.data.table is

if (missing(j)) stop("logical error, j missing")
jsub = substitute(j)
if (is.null(jsub)) return(NULL)
jsubl = as.list.default(jsub)
if (identical(jsubl[[1L]],quote(eval))) {
    jsub = eval(jsubl[[2L]],parent.frame())
    if (is.expression(jsub)) jsub = jsub[[1L]]
}

if (in your case)

j = list(xout = eval(myfun('x'))) 

##then

jsub <- substitute(j) 

is

 #  list(xout = eval(myfun("x")))

and

as.list.default(jsub)
## [[1]]
## list
## 
## $xout
## eval(myfun("x"))

so jsubl[[1L]] is list, jsubl[[2L]] is eval(myfun("x"))

so data.table has not found a call to evaland will not deal with it appropriately.

This will work, forcing the second evaluation within correct data.table

# using OP myfun
dat[,list(xout =eval(myfun('x'), dat))]

The same way

eval(parse(text = 'x_one'),dat)
# [1]  1  2  3  4  5  6  7  8  9 10

Works but

 eval(eval(parse(text = 'x_one')), dat)

Does not

Edit 10/4/13

Although it is probably safer (but slower) to use .SD as the environment, as it will then be robust to i or by as well eg

dat[,list(xout =eval(myfun('x'), .SD))]

Edit from Matthew :

+10 to above. I couldn't have explained it better myself. Taking it a step further, what I sometimes do is construct the entire data.table query and then eval that. It can be a bit more robust that way, sometimes. I think of it like SQL; i.e, we often construct a dynamic SQL statement that is sent to the SQL server to be executed. When you are debugging, too, it's also sometimes easier to look at the constructed query and run that at the browser prompt. But, sometimes such a query would be very long, so passing eval into i,j or by can be more efficient by not recomputing the other components. As usual, many ways to skin the cat.

The subtle reasons for considering evaling the entire query include :

  1. One reason grouping is fast is that it inspects the j expression first. If it's a list, it removes the names, but remembers them. It then evals an unnamed list for each group, then reinstates the names once, at the end on the final result. One reason other methods can be slow is the recreation of the same column name vector for each and every group, over and over again. The more complex j is defined though (e.g. if the expression doesn't start precisely with list), the harder it gets to code up the inspection logic internally. There are lots of tests in this area; e.g., in combination with eval, and verbosity reports if name dropping isn't working. But, constructing a "simple" query (the full query) and evaling that may be faster and more robust for this reason.

  2. With v1.8.2 there's now optimization of j: options(datatable.optimize=Inf). This inspects j and modifies it to optimize mean and the lapply(.SD,...) idiom, so far. This makes orders of magnitude difference and means theres less for the user to need to know (e.g. a few of the wiki points have gone away now). We could do more of this; e.g., DT[a==10] could be optimized to DT[J(10)] automatically if key(DT)[1]=="a" [Update Sep 2014 - now implemented in v1.9.3]. But again, the internal optimizations get harder to code up internally if rather than DT[,mean(a),by=b] it's DT[,list(x=eval(expr)),by=b] where expr contained a call to mean, for example. So evaling the entire query may play nicer with datatable.optimize. Turning verbosity on reports what it's doing and optimization can be turned off if needed; e.g., to test the speed difference it makes.

As per comments, FR#2183 has been added: "Change j=list(xout=eval(...))'s eval to eval within scope of DT". Thanks for highlighting. That's the sort of complex j I mean where the eval is nested in the expression. If j starts with eval, though, that's much simpler and already coded (as shown above) and tested, and should be optimized fine.

If there's one take-away from this then it's: do use DT[...,verbose=TRUE] or options(datatable.verbose=TRUE) to check data.table is still working efficiently when used for dynamic queries involving eval.

Matt Dowle
  • 58,872
  • 22
  • 166
  • 224
mnel
  • 113,303
  • 27
  • 265
  • 254
  • Interesting! If anyone has insight into why `data.table` cannot find the column names in my example but can find them in this version I am all ears! – Justin Aug 08 '12 at 23:15
  • In my example the whole call to `list` is evaluated, whereas yours will have two evaluation environments, and perhaps a different environment needed to be called (from within data.table). I was trying to decipher the source code to `[.data.table` but think that Matt Doyle will have to do so [and perhaps update to allow this] – mnel Aug 08 '12 at 23:34
  • Well done! Thank you for the detective work and the explanation. – Justin Aug 09 '12 at 00:13
  • I would think there could be a way to set the environment of the nested eval call to be the same using some `parent.frame` / `sys.frame` magic. But I feel like I chasing a white rabbit down the rabbit hole. – mnel Aug 09 '12 at 00:45
  • 1
    +10 I couldn't have explained it better myself. Have added [FR#2183](https://r-forge.r-project.org/tracker/index.php?func=detail&aid=2183&group_id=240&atid=978) "Change j=list(xout=eval(...))'s eval to eval within scope of DT". However in this case Justin wants the output column name to be flexible as well, iiuc, so your solution of including the `list(...)` in the expression is needed anyway it seems. – Matt Dowle Aug 09 '12 at 09:24
  • 1
    @MatthewDowle and mnel, thank you both for the thorough explanation! Someday I vow to understand the intricacies of `data.table` – Justin Aug 09 '12 at 14:03
  • @MattDowle, in case of many columns in `DT`, the `eval` within `.()` is 50x (!) slower than without - far better to just convert to `DT` after calculation. Here is a test: `dt <- data.table::data.table('c1' = 1:5000, 'c2' = runif(5000), t(1:500)); microbenchmark::microbenchmark(tmp <- dt[, eval(parse(text = 'c1*c2'))], tmp <- dt[, .('c1*c2' = eval(parse(text = 'c1*c2')))], tmp <- data.table::data.table('c1*c2' = dt[, eval(parse(text = 'c1*c2'))]), times = 1000)`. I don't think this is normal? – Davor Josipovic Jun 07 '17 at 11:53
  • @DavorJosipovic That test has `times=1000` and you're stating "50x(!)". So you're testing overhead and that "50x" applies to very small timings. Set `times=3` and increase the data size. Then it would be a reasonable benchmark. Unless, in the real world, you do actually need to calling it 1000 times? – Matt Dowle Jun 07 '17 at 21:19
  • @MattDowle, the average (mean) execution is about 50x longer. You can not reproduce it? The test has been run 1000 times to get accurate average. The average relates to executing the code just once, not 1000 times. – Davor Josipovic Jun 07 '17 at 21:35
  • @DavorJosipovic Fine. I ran it. So the means are 0.3 milliseconds -vs- 12 milliseconds. 0.3ms * 50 = 12ms. I really can't see how my first comment could be any clearer. In other words ... who cares about 12ms? Do you really care about 12ms? Again: "So you're testing overhead and that "50x" applies to very small timings. Set times=3 and increase the data size. Then it would be a reasonable benchmark. Unless, in the real world, you do actually need to calling it 1000 times?" – Matt Dowle Jun 07 '17 at 22:21
  • @DavorJosipovic To answer your question : yes it's very normal in all computer languages for 1,000 calls to a function to reveal the overheads in calling the function. This only ever matters if you really do need to call such function a lot very quickly; e.g. real time latency sensitive applications. Is that what you're using data.table for? – Matt Dowle Jun 07 '17 at 22:31
  • @MattDowle, indeed. I call data.table many many times. The more columns there are, the slower the `.()`. But yes, depending on the point of view it matters, or does not. – Davor Josipovic Jun 08 '17 at 13:04
2

This doesn't feel ideal, but it's the best I've been able to come up with. I'll throw it out there just to see if it helps draw out any better responses...

vars <- c("x", "y")
res <- do.call(data.table, (lapply(vars, function(X) dat[,eval(myfun(X)),])))
setnames(res, names(res), paste0(vars, "_out"))

## Check the results
head(res, 3)
#    x_out y_out
# 1:     0     0
# 2:     0     0
# 3:     0     0

The part I don't like is that lapply() will create one copy of the output, in list form, and then data.table() will (as far as I understand) have to copy those data to a separate location, which is worse than if you'd used the list() construct within [.data.frame().

Josh O'Brien
  • 159,210
  • 26
  • 366
  • 455
  • Thanks! Your version looks more like my code did without `data.table`. I can always fall back on writing out all the column names, but then I don't get any style points! – Justin Aug 08 '12 at 22:58