24

I am trying to write some wrapper functions to reduce code duplication with data.table.

Here is an example using mtcars. First, set up some data:

library(data.table)
data(mtcars)
mtcars$car <- factor(gsub("(.*?) .*", "\\1", rownames(mtcars)), ordered=TRUE)
mtcars <- data.table(mtcars)

Now, here is what I would usually write to get a summary of counts by group. In this case I am grouping by car:

mtcars[, list(Total=length(mpg)), by="car"][order(car)]

      car Total
      AMC     1
 Cadillac     1
   Camaro     1
...
   Toyota     2
  Valiant     1
    Volvo     1

The complication is that, since the arguments i and j are evaluated in the frame of the data.table, one has to use eval(...) if you want to pass in variables:

This works:

group <- "car"
mtcars[, list(Total=length(mpg)), by=eval(group)]

But now I want to order the results by the same grouping variable. I can't get any variant of the following to give me correct results. Notice how I always get a single row of results, rather than the ordered set.

mtcars[, list(Total=length(mpg)), by=eval(group)][order(group)]
   car Total
 Mazda     2

I know why: it's because group is evaluated in the parent.frame, not the frame of the data.table.

How can I evaluate group in the context of the data.table?

More generally, how can I use this inside a function? I need the following function to give me all the results, not just the first row of data:

tableOrder <- function(x, group){
  x[, list(Total=length(mpg)), by=eval(group)][order(group)]
}

tableOrder(mtcars, "car")
Uwe
  • 41,420
  • 11
  • 90
  • 134
Andrie
  • 176,377
  • 47
  • 447
  • 496

3 Answers3

14

Gavin and Josh are right. This answer is only to add more background. The idea is that not only can you pass variable column names into a function like that, but expressions of column names, using quote().

group = quote(car)
mtcars[, list(Total=length(mpg)), by=group][order(group)]
      group Total
        AMC     1
   Cadillac     1
     ...
     Toyota     2
    Valiant     1
      Volvo     1

Although, admitedly more difficult to start with, it can be more flexible. That's the idea, anyway. Inside functions you need substitute(), like this :

tableOrder = function(x,.expr) {
    .expr = substitute(.expr)
    ans = x[,list(Total=length(mpg)),by=.expr]
    setkeyv(ans, head(names(ans),-1))    # see below re feature request #1780
    ans
}

tableOrder(mtcars, car)
      .expr Total
        AMC     1
   Cadillac     1
     Camaro     1
      ...
     Toyota     2
    Valiant     1
      Volvo     1

tableOrder(mtcars, substring(car,1,1))  # an expression, not just a column name
      .expr Total
 [1,]     A     1
 [2,]     C     3
 [3,]     D     3
 ...
 [8,]     P     2
 [9,]     T     2
[10,]     V     2

tableOrder(mtcars, list(cyl,gear%%2))   # by two expressions, so head(,-1) above
     cyl gear Total
[1,]   4    0     8
[2,]   4    1     3
[3,]   6    0     4
[4,]   6    1     3
[5,]   8    1    14

A new argument keyby was added in v1.8.0 (July 2012) making it simpler :

tableOrder = function(x,.expr) {
    .expr = substitute(.expr)
    x[,list(Total=length(mpg)),keyby=.expr]
}

Comments and feedback in the area of i,j and by variable expressions are most welcome. The other thing you can do is have a table where a column contains expressions and then look up which expression to put in i, j or by from that table.

Matt Dowle
  • 58,872
  • 22
  • 166
  • 224
  • +1 Useful info Matthew, I was wondering what the docs really meant by an "expression" and tried to fiddle with `expression()` so the pointer to `quote()` is most helpful. – Gavin Simpson Mar 14 '12 at 18:38
  • @Gavin NP. Did you see FAQ 1.6 and 1.7? I just remembered they cover exactly this point. – Matt Dowle Mar 14 '12 at 18:50
  • @Gavin Any tips how to improve docs much appreciated. I'm too close to the trees to see the wood. – Matt Dowle Mar 14 '12 at 18:54
  • @GavinSimpson -- Very good point about the ambiguity of the term "expression" in an R context. `quote(car)` is really a **call** that evaluates to an object of class **name**. `group` is then an object whose *value* is a **name** object. Objects of class **expression** don't enter in at all. I'm not sure what the best words for explaining that would be, though. Saying that "by can also take the name of an object whose value is a name" is precise but probably confusing as heck for most. Keeping the examples prominent, as Matthew does, is probably the best approach. – Josh O'Brien Mar 14 '12 at 19:31
  • +1 Thank you. FWIW, I studied FAQ 1.6 and 1.7 at length, but still couldn't find a solution to this prior to me posting this question. These FAQs were very helpful to point me to use `eval(...)` as the `j` argument, but I don't recall seeing a reference to `substitute()` in the same. Anyway, many thanks. Your answer is exactly what I needed. – Andrie Mar 14 '12 at 20:56
  • @MatthewDowle No, I hadn't seen those FAQ entries - I was mainly looking at `?[.data.table` to see what was special in the method. There I saw "expression" and my mind jumped to `expression()`. I'd already got `get()` to work so didn't follow up and your Answer was sufficient for me to get what was going on. It'll be tricky to get clear prose without hitting problems @Josh mentioned - `?quote` mentions "expression" explicitly so `?[.data.table` is not using the wrong terminology - but the FAQ you point to is pretty darned clear to me (and useful). – Gavin Simpson Mar 14 '12 at 21:04
  • @Andrie Great. Yes, `substitute()` is mentioned in FAQ 1.6, as an _idiom_ no less, sort of, but I guess it could be clearer. – Matt Dowle Mar 14 '12 at 22:12
  • @Josh Hm, good point. Wherever I've written _expression_ what I meant was anything that can be `eval`'d within the frame using column names as variables. Maybe that's anything that is `is.language` which appears to include `name`, `call` and `expression`! As Gavin says, `?quote` seems to use the word expression (not in a code font but in English), so that's how it's meant. I've never used `expression()` I don't think, if that helps to know! The same applies to all 3 arguments by the way: `i`,`j` and `by` can all be passed things that can be `eval`'d ;) – Matt Dowle Mar 14 '12 at 22:30
  • @Josh Ah here we go: `?eval` makes it clear. The first argument `expr` is an "object that can be evaluated" including "`call` or `expression` or `name`". So it seems _expression_ (English) != `expression` (code), but rather _expression_ == `expr` (argument name) == `is.language`. – Matt Dowle Mar 14 '12 at 22:56
11

Use get(group) to refer to the object named in group:

> mtcars[, list(Total=length(mpg)), by=eval(group)][order(get(group))]
        car Total
        AMC     1
   Cadillac     1
     Camaro     1
   Chrysler     1
     Datsun     1
      Dodge     1
     Duster     1
    Ferrari     1
       Fiat     2
       Ford     1
      Honda     1
     Hornet     2
    Lincoln     1
      Lotus     1
   Maserati     1
      Mazda     2
       Merc     7
    Pontiac     1
    Porsche     1
     Toyota     2
    Valiant     1
      Volvo     1
cn      car Total
> # vs
> mtcars[, list(Total=length(mpg)), by=eval(group)][order(group)]
       car Total
[1,] Mazda     2

The reason order(get(group)) works is that the expression is evaluated in the frame of the data.table. There, get(group) will look for an find variable car. If you evaluate that in the global environment it does't exist

> get(group)
Error in get(group) : object 'car' not found

but it does in the frame where evaluation takes place. group doesn't exist there but following usual rules it searches back up the parent frames until it finds something that matches group, which is the global env in this case. As such you need to be careful about the name of the object you use as group in your real function - you don't want to use something that might match in the data.table object for example. Using something like .group as the function arg would be pretty safe I guess.

Here is your function, modified:

tableOrder <- function(x, .group){
  x[, list(Total=length(mpg)), by=eval(.group)][order(get(.group))]
}

> tableOrder(mtcars, "car")
        car Total
        AMC     1
   Cadillac     1
     Camaro     1
   Chrysler     1
     Datsun     1
....
Gavin Simpson
  • 170,508
  • 25
  • 396
  • 453
10

For the general question of how to control scoping within data.table, Gavin's answer has got you well covered.

To really take full advantage of the data.table package's strengths, though, you should be setting the key for your data.table objects. A key causes your data to be presorted so that rows from the same level (or combinations of levels) of the grouping factor(s) are stored in contiguous blocks of memory. This can in turn greatly speed up grouping operations compared to 'ad hoc by's of the sort used in your example. (Search for 'ad hoc' in the datatable-faq (warning, pdf) for more details).

In many situations (your example included) using keys also has the happy side-effect of simplifying the code needed to manipulate a data.table. Plus, it automatically outputs the results in the order specified by the key, which is often what you want as well.

First, if you will only be needing to subset by the 'car' column, you could simply do:

## Create data.table with a key
group <- "car"
mtcars <- data.table(mtcars, key = group)

## Outputs results in correct order
mtcars[, list(Total=length(mpg)), by = key(mtcars)]
        car Total
        AMC     1
   Cadillac     1
     Camaro     1
   Chrysler     1
     Datsun     1

Even if your key contains several columns, using a key still makes for simpler code (and you gain the speed-up that's likely your real reason for using data.table in the first place!):

group <- "car"
mtcars <- data.table(mtcars, key = c("car", "gear"))
mtcars[, list(Total=length(mpg)), by = eval(group)]

EDIT: A picky note of caution

If the by argument is used to perform grouping based on a column that is part of the key but that is not the first element of the key the order of the results may still need post processing. So, in the second example above, if key = c("gear", "car"), then "Dodge" sorts before "Datsun". In a situation like that, I might still prefer to reorder the key beforehand, rather than reorder the results after the fact. Perhaps Matthew Dowle will weigh in which of those two is preferred/faster.

Josh O'Brien
  • 159,210
  • 26
  • 366
  • 455
  • 1
    Weighing in as requested ... you're spot on. If the groups are large then getting them contiguous in RAM saves page fetches to L2. However, if there are millions of 2-row groups, then it doesn't make much difference. Even when secondary keys are built in, a by on the primary key will still be faster for that reason. Very similar to a *clustered index* in SQL; i.e., that's the way the rows are actually stored on disk, to save data block reads. It's confusing though isn't it because people might naturally assume it would be faster to sort the (much smaller number of) aggregated rows afterwards. – Matt Dowle Mar 14 '12 at 18:47
  • 1
    A keyed by doesn't need to sort first, but an ad hoc does. So (I think) setkey + keyedby is comparable to ad hoc by. Need to be careful to compare apples to apples. Anyway, if `j` uses a small subset of `DT`'s columns, then a `setkey` on `DT[,columnsneeded]` (only) might be worth it, to save reordering not needed columns. – Matt Dowle Mar 14 '12 at 19:04
  • 1
    @MatthewDowle -- Thanks. That's very enlightening. One nice aspect of using `setkey` is that you only need to sort the data once. But is it also significantly faster to use it to sort a data.table that it is to sort the same data.table once using `method="radix"`? I suspect you're alluding to some such speed advantage when you write, "this is also one reason why setkey() is quick", in FAQ 3.2, but I'm not sure I know where further details are available. – Josh O'Brien Mar 14 '12 at 19:04
  • 1
    Looks like our last comments crossed paths, but that we were thinking about the same issue. You put it more succinctly, though ;) – Josh O'Brien Mar 14 '12 at 19:06
  • 1
    The sorting that setkey does is the same function called by ad hoc by: `data.table:::fastorder`. FAQ 3.2 just means that sorting is fast. In v1.8.0 (stable on R-Forge and soon on CRAN), sorting is fast for `character` too, and `character` is allowed in keys. At last! Internally it uses a counting sort for `character` and can even be faster than "radix" is some tests (and almost as fast in others), but for `character`. Factors are no longer mandated, but still supported. – Matt Dowle Mar 14 '12 at 19:18