18

I am stuck with a small R issue with data.table. Your help is much appreciated. How do I do this:

getResult <- function(dt, expr, gby) {
  e <- substitute(expr)
  b <- substitute(gby)
  return(dt[,eval(e),by=b])
}

v1 <- "Sepal.Length"
v2 <- "Species"

dt <- data.table(iris)
rDT <- getResult(dt, sum(v1, na.rm=TRUE), v2)

I get following error:

Error in sum(v1, na.rm = TRUE) : invalid 'type' (character) of argument

Now, both v1 and v2 get passed from other program as character variable so I can't do this v1<- quote(Sepal.Length) which seems to work.

user1157129
  • 183
  • 2
  • 6
  • 10
    This may put you on the right track: `dt[, sum(get(v1), na.rm=TRUE), by=v2]` or suggest an alternative approach if you are flexible. – flodel May 20 '12 at 17:31
  • Thx. It worked, what happened? Function gets the object named v1. What did the substitute function did to this expression? Did it not do anything and tried to replace v1 with character value "Sepal.Length"? – user1157129 May 20 '12 at 18:26

1 Answers1

22

An alternative to flodel's answer in the comments could be

e <- parse(text = paste0("sum(", v1, ", na.rm = TRUE)"))

b <- parse(text = v2)

rDT2 <- dt[, eval(e), by = eval(b)]

#               b    V1
# [1,]     setosa 250.3
# [2,] versicolor 296.8
# [3,]  virginica 329.4

EDIT:

And to put this into a function,

getResult <- function(dt, expr, gby){
  return(dt[, eval(expr), by = eval(gby)])
}

(dtR <- getResult(dt = dt, expr = e, gby = b))
# gives the same result as above


EDIT from Matthew: There's a subtle reason why the paste0 and eval \ quote methods can be faster than get in some cases, too. One of the reasons grouping can be fast is that data.table inspects j to see which columns it uses, then only subsets those used columns (FAQ 1.12 and 3.1). It uses base::all.vars(j) to do that. When using get() in j the column being used is hidden from all.vars and data.table falls back to subsetting all the columns just in case the j expression needs them (much like when the .SD symbol is used in j, for which .SDcols was added to solve). If all the columns are used anyway then it doesn't make a difference, but if DT is say 1e7x100 then a grouped j=sum(V1) should be much faster than a grouped j=sum(get("V1")) for that reason. At least, that's what's supposed to happen, and if it doesn't then it may be a bug. If on the other hand many queries are being constructed dynamically and repeated then the time to paste0 and parse might come into it. All depends really. Setting verbose=TRUE should print out a message about which columns have been detected as used by j, so that can be checked.

Matt Dowle
  • 58,872
  • 22
  • 166
  • 224
BenBarnes
  • 19,114
  • 6
  • 56
  • 74
  • Thanks, going back the original question, how do I do this with your solution getResult <- function(dt, expr, gby) { print(dt[,eval(expr), by=eval(b)]) } v1 <- "Sepal.Length" v2 <- "Species" e <- parse(text = paste("sum(", v1, ", na.rm = TRUE)")) b <- parse(text = v2) #rDT2 <- dt[, eval(e), by = eval(b)] dtR <- getResult(dt, e, b) – user1157129 May 20 '12 at 19:08
  • @user1157129, Sorry for the omission of the function as requested in your question. Please see the edit for a suggestion. – BenBarnes May 20 '12 at 20:39
  • Sorry Ben, it is not working, am I goofing up something? getResult <- function(dt, expr, gby) { return(dt[,eval(expr), by=eval(gby)]) } dt <- data.table(iris) v1 <- "Sepal.Length" v2 <- "Species" e <- parse(text = paste("sum(", v1, ", na.rm = TRUE)")) b <- parse(text = v2) dtR <- getResult(dt = dt, expr = e, gby = b) – user1157129 May 21 '12 at 03:05
  • @user1157129, copying your above code and inserting line breaks (and loading the data.table package) works for me using R 2.15.0 and data.table 1.8.0 - are you getting an error message? What isn't working? – BenBarnes May 21 '12 at 05:45
  • I get following error Ben - Error in names(byval) = as.character(bysuborig) : 'names' attribute [2] must be the same length as the vector [1] – user1157129 May 21 '12 at 06:06
  • 1
    @user1157129, I get this error using R 2.13.1 patched and data.table 1.7.10 - would you be able to upgrade to the latest versions of the two packages? – BenBarnes May 21 '12 at 06:29
  • 1
    Yes, the upgrades to R 2.15 and data.table 1.8 fixed this issue. Thanks Ben for your help! – user1157129 May 21 '12 at 08:06