6

I'm struggling to use data.table to summarize results of vector functions, something that's easy in ddply.

Issue 1: aggregate with an (expensive) function with vector output

dt <- data.table(x=1:20,y=rep(c("a","b"),each=10))

This ddply command produces what I want:

ddply(dt,~y,function(dtbit) quantile(dtbit$x))

This data table command does not do what I want:

dt[,quantile(x),by=list(y)]

I can hack at data.table like so:

dt[,list("0%"=quantile(x,0),"25%"=quantile(x,0.25),
    "50%"=quantile(x,0.5)),by=list(y)]

But that verbose, and also would be slow if the vector function "quantile" were slow.

A similar example is:

dt$z <- rep(sqrt(1:10),2)

ddply(dt,~y,function(dtbit) coef(lm(z~x,dtbit)))

Issue 2: Using a function with both vector input and output

xzsummary <- function(dtbit) t(summary(dtbit[,"x"]-dtbit[,"z"]))

ddply(dt,~y,xzsummary )

Can I do that kind of thing easily in data.table?

Apologies if these questions are already prominently answered.

This is a similar, not identical, issue to: data.table aggregations that return vectors, such as scale()

Community
  • 1
  • 1
ewallace
  • 213
  • 3
  • 5

1 Answers1

8
> dt[ , as.list(quantile(x)),by=y]
   y 0%   25%  50%   75% 100%
1: a  1  3.25  5.5  7.75   10
2: b 11 13.25 15.5 17.75   20

I tried using rbind, but that failed to generate the by-y arrangement I was thinking you wanted. The trick with as.list (vs. list) is that it constructs a multi-element list wehn givne a vector, whereas list only puts the vector into a single element list.

as.list acts like sapply(x, list):

> dt[ , sapply(quantile(x), list), by=y]
   y 0%   25%  50%   75% 100%
1: a  1  3.25  5.5  7.75   10
2: b 11 13.25 15.5 17.75   20

Your target solution:

> ddply(dt,~y,function(dtbit) quantile(dtbit$x))
  y 0%   25%  50%   75% 100%
1 a  1  3.25  5.5  7.75   10
2 b 11 13.25 15.5 17.75   20

I was kind of proud of that solution, but mindful of fortunes::fortune("Liaw-Baron principle") ............

Lastly, by what we could call the 'Liaw-Baron principle', every question that can be asked has in fact already been asked. -- Dirk Eddelbuettel (citing Andy Liaw's and Jonathan Baron's opinion on unique questions on R-help) R-help (January 2006)

.... I did a search on: [r] data.table as.list, and find that I am by no means the first to post this strategy on SO:

Tabulate a data frame in R

Using ave() with function which returns a vector

create a formula in a data.table environment in R

I don't really know if this question would be considered a duplicate, but I am particularly grateful to @G.Grothedieck for the last one. It may be where I picked up the strategy. There were about 125 hits to that search and I've only gone through the first 20 to gather those examples, so there may be some more pearls that I haven't uncovered.

Community
  • 1
  • 1
IRTFM
  • 258,963
  • 21
  • 364
  • 487