8

I am looking for patterns for manipulating data.table objects whose structure resembles that of dataframes created with melt from the reshape2 package. I am dealing with data tables with millions of rows. Performance is critical.

The generalized form of the question is whether there is a way to perform grouping based on a subset of values in a column and have the result of the grouping operation create one or more new columns.

A specific form of the question could be how to use data.table to accomplish the equivalent of what dcast does in the following:

input <- data.table(
  id=c(1, 1, 1, 2, 2, 2, 3, 3, 3, 3), 
  variable=c('x', 'y', 'y', 'x', 'y', 'y', 'x', 'x', 'y', 'other'),
  value=c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10))
dcast(input, 
  id ~ variable, sum, 
  subset=.(variable %in% c('x', 'y')))

the output of which is

  id  x  y
1  1  1  5
2  2  4 11
3  3 15  9
Sim
  • 13,147
  • 9
  • 66
  • 95
  • i'm confused, sorry -- the dcast() function works without the as.data.frame() coercion. are you trying to accomplish this without using the plyr package? – Anthony Damico Dec 20 '12 at 09:13
  • @AnthonyDamico yes, I'd like to accomplish this without `dcast` which is in `reshape2`. I have removed the as.data.frame()--thanks for pointing that out. – Sim Dec 20 '12 at 09:14

3 Answers3

9

Quick untested answer: seems like you're looking for by-without-by, a.k.a. grouping-by-i :

setkey(input,variable)
input[c("x","y"),sum(value)]

This is like a fast HAVING in SQL. j gets evaluated for each row of i. In other words, the above is the same result but much faster than :

input[,sum(value),keyby=variable][c("x","y")]

The latter subsets and evals for all the groups (wastefully) before selecting only the groups of interest. The former (by-without-by) goes straight to the subset of groups only.

The group results will be returned in long format, as always. But reshaping to wide afterwards on the (relatively small) aggregated data should be relatively instant. That's the thinking anyway.

The first setkey(input,variable) might bite if input has a lot of columns not of interest. If so, it might be worth subsetting the columns needed :

DT = setkey(input[ , c("variable","value")], variable)
DT[c("x","y"),sum(value)]

In future when secondary keys are implemented that would be easier :

set2key(input,variable)              # add a secondary key 
input[c("x","y"),sum(value),key=2]   # syntax speculative

To group by id as well :

setkey(input,variable)
input[c("x","y"),sum(value),by='variable,id']

and including id in the key might be worth setkey's cost depending on your data :

setkey(input,variable,id)
input[c("x","y"),sum(value),by='variable,id']

If you combine a by-without-by with by, as above, then the by-without-by then operates just like a subset; i.e., j is only run for each row of i when by is missing (hence the name by-without-by). So you need to include variable, again, in the by as shown above.

Alternatively, the following should group by id over the union of "x" and "y" instead (but the above is what you asked for in the question, iiuc) :

input[c("x","y"),sum(value),by=id]
MichaelChirico
  • 33,841
  • 14
  • 113
  • 198
Matt Dowle
  • 58,872
  • 22
  • 166
  • 224
  • Secondary keys would be very helpful. Alternatively, a way to build "variations" by reference using different keys. The pattern I keep discovering is needing to change the key several times for different operations with the same by clause but different select clauses and then manually assemble the final result from the datatables of each operation. – Sim Dec 20 '12 at 17:29
3
> setkey(input, "id")
> input[ , list(sum(value)), by=id]
   id V1
1:  1  6
2:  2 15
3:  3 34

> input[ variable %in% c("x", "y"), list(sum(value)), by=id]
   id V1
1:  1  6
2:  2 15
3:  3 24

The last one:

> input[ variable %in% c("x", "y"), list(sum(value)), by=list(id, variable)]
   id variable V1
1:  1        x  1
2:  1        y  5
3:  2        x  4
4:  2        y 11
5:  3        x 15
6:  3        y  9
IRTFM
  • 258,963
  • 21
  • 364
  • 487
  • This performs `sum` for all variables as opposed to just `x` and `y`. In my case there are hundreds of different variables so I need a subsetting clause. I'd love not to have to do this by creating an entire temporary datatable. – Sim Dec 20 '12 at 09:26
  • 1
    this is closer but still not perfect keycols <- c( "id" , "variable" ) ; setkeyv( input , keycols ) ; input[ input$variable %in% c( 'x' , 'y' ) , list( sum( value ) ) , by = keycols ] – Anthony Damico Dec 20 '12 at 09:29
  • @Sim: I was just showing you options. The last one is all you need and it doesn't create any intermediate table. – IRTFM Dec 20 '12 at 09:31
  • @Dwin not sure what you mean by "the last one". Do you think I can use `.BY` in the `j` expression to filter the variable values? – Sim Dec 20 '12 at 09:35
  • @AnthonyDamico is the index on `id` necessary since the table will need to be scanned anyway? – Sim Dec 20 '12 at 09:37
  • @DWin, I now see what you meant by the "last one". I awarded the answer to Matthew because of the explanation and because his solution runs 30% faster than yours in my tests. – Sim Dec 21 '12 at 00:10
2

I'm not sure if this is the best way, but you can try:

input[, list(x = sum(value[variable == "x"]), 
             y = sum(value[variable == "y"])), by = "id"]
#    id  x  y
# 1:  1  1  5
# 2:  2  4 11
# 3:  3 15  9
A5C1D2H2I1M1N2O1R2T1
  • 190,393
  • 28
  • 405
  • 485
  • Doesn't this create two temporary vectors for the `variable` equality for every value of `id`? – Sim Dec 20 '12 at 09:40