59

I think I am using plyr incorrectly. Could someone please tell me if this is 'efficient' plyr code?

require(plyr)
plyr <- function(dd) ddply(dd, .(price), summarise, ss=sum(volume)) 

A little context: I have a few large aggregation problems and I have noted that they were each taking some time. In trying to solve the issues, I became interested in the performance of various aggregation procedures in R.

I tested a few aggregation methods - and found myself waiting around all day.

When I finally got results back, I discovered a huge gap between the plyr method and the others - which makes me think that I've done something dead wrong.

I ran the following code (I thought I'd check out the new dataframe package while I was at it):

require(plyr)
require(data.table)
require(dataframe)
require(rbenchmark)
require(xts)

plyr <- function(dd) ddply(dd, .(price), summarise, ss=sum(volume)) 
t.apply <- function(dd) unlist(tapply(dd$volume, dd$price, sum))
t.apply.x <- function(dd) unlist(tapply(dd[,2], dd[,1], sum))
l.apply <- function(dd) unlist(lapply(split(dd$volume, dd$price), sum))
l.apply.x <- function(dd) unlist(lapply(split(dd[,2], dd[,1]), sum))
b.y <- function(dd) unlist(by(dd$volume, dd$price, sum))
b.y.x <- function(dd) unlist(by(dd[,2], dd[,1], sum))
agg <- function(dd) aggregate(dd$volume, list(dd$price), sum)
agg.x <- function(dd) aggregate(dd[,2], list(dd[,1]), sum)
dtd <- function(dd) dd[, sum(volume), by=(price)]

obs <- c(5e1, 5e2, 5e3, 5e4, 5e5, 5e6, 5e6, 5e7, 5e8)
timS <- timeBasedSeq('20110101 083000/20120101 083000')

bmkRL <- list(NULL)

for (i in 1:5){
  tt <- timS[1:obs[i]]

  for (j in 1:8){
    pxl <- seq(0.9, 1.1, by= (1.1 - 0.9)/floor(obs[i]/(11-j)))
    px <- sample(pxl, length(tt), replace=TRUE)
    vol <- rnorm(length(tt), 1000, 100)

    d.df <- base::data.frame(time=tt, price=px, volume=vol)
    d.dfp <- dataframe::data.frame(time=tt, price=px, volume=vol)
    d.matrix <- as.matrix(d.df[,-1])
    d.dt <- data.table(d.df)

    listLabel <- paste('i=',i, 'j=',j)

    bmkRL[[listLabel]] <- benchmark(plyr(d.df), plyr(d.dfp), t.apply(d.df),     
                         t.apply(d.dfp), t.apply.x(d.matrix), 
                         l.apply(d.df), l.apply(d.dfp), l.apply.x(d.matrix),
                         b.y(d.df), b.y(d.dfp), b.y.x(d.matrix), agg(d.df),
                         agg(d.dfp), agg.x(d.matrix), dtd(d.dt),
          columns =c('test', 'elapsed', 'relative'),
          replications = 10,
          order = 'elapsed')
  }
}

The test was supposed to check up to 5e8, but it took too long - mostly due to plyr. The 5e5 the final table shows the problem:

$`i= 5 j= 8`
                  test  elapsed    relative
15           dtd(d.dt)    4.156    1.000000
6        l.apply(d.df)   15.687    3.774543
7       l.apply(d.dfp)   16.066    3.865736
8  l.apply.x(d.matrix)   16.659    4.008422
4       t.apply(d.dfp)   21.387    5.146054
3        t.apply(d.df)   21.488    5.170356
5  t.apply.x(d.matrix)   22.014    5.296920
13          agg(d.dfp)   32.254    7.760828
14     agg.x(d.matrix)   32.435    7.804379
12           agg(d.df)   32.593    7.842397
10          b.y(d.dfp)   98.006   23.581809
11     b.y.x(d.matrix)   98.134   23.612608
9            b.y(d.df)   98.337   23.661453
1           plyr(d.df) 9384.135 2257.972810
2          plyr(d.dfp) 9384.448 2258.048123

Is this right? Why is plyr 2250x slower than data.table? And why didn't using the new data frame package make a difference?

The session info is:

> sessionInfo()
R version 2.15.1 (2012-06-22)
Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit)

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] xts_0.8-6        zoo_1.7-7        rbenchmark_0.3   dataframe_2.5    data.table_1.8.1     plyr_1.7.1      

loaded via a namespace (and not attached):
[1] grid_2.15.1    lattice_0.20-6 tools_2.15.1 
Troy Alford
  • 26,660
  • 10
  • 64
  • 82
ricardo
  • 8,195
  • 7
  • 47
  • 69
  • 3
    For relatively simple data manipulation/aggregation problems, I have found data table to be extremely fast. If it can do it, I am not at all surprised it is the clear winner. I am not familiar enough with `plyr` to comment on it. – Joshua Jul 18 '12 at 02:30
  • 1
    Have you looked at the documentation for `plyr` and `data.table`? If I remember correctly, `plyr` works with base-`R` `data.frame`s. `data.table` uses a whole different representation, using keyed columns, and efficient radix sorting. It's much more database-like in this way. – Jason Morgan Jul 18 '12 at 02:33
  • i have had a look - but couldn't figure it out. plyr is more than just a bit slower... the apply family, agg, and by are very quick - and they are base. that's why i figured that i must be making some rookie error with plyr. – ricardo Jul 18 '12 at 02:44
  • plyr is very nice when you're doing a lot of different things. It's syntax makes coding things very easy. It's not necessarily the most efficient way to do things as you've seen but it is quite convenient for certain tasks. – Dason Jul 18 '12 at 02:47
  • 24
    I don't think you're doing anything wrong. The reason **plyr** is so popular is not because its fast, but because its syntax is far nicer than most other options. There are tons of people who never touch data larger than 10e5 or so, and so never really notice that its that slow. If you really have data on the order of 10e8 you'd be a fool not to use **data.table**. – joran Jul 18 '12 at 02:50
  • 7
    @joran -- Spot on. I've used **data.table** enough now that (much of) its syntax is starting to make "deep" sense to me, but when I **first** saw **plyr**, I immediately thought "*that's* what I've been looking for all these years". I'd be interested to see how much of **data.table**'s functionality could be wrapped in a more newbie friendly front end; `merge.data.table()` seems to be one existing example of this idea. – Josh O'Brien Jul 18 '12 at 03:00
  • 3
    Check out my answer to a similar question here: [(LINK)](http://stackoverflow.com/questions/10645815/why-are-lubridate-functions-so-slow-when-compared-with-as-posixct/10653798#10653798) – Tyler Rinker Jul 18 '12 at 03:29
  • 1
    See also http://www.numbertheory.nl/2011/10/28/comparison-of-ave-ddply-and-data-table/, http://www.mail-archive.com/r-help@r-project.org/msg142797.html, and http://groups.google.com/group/manipulatr/browse_thread/thread/5e8dfed85048df99 – Paul Hiemstra Aug 10 '12 at 08:52
  • Another worthwhile/relevant reference is http://stackoverflow.com/questions/3505701/r-grouping-functions-sapply-vs-lapply-vs-apply-vs-tapply-vs-by-vs-aggrega – Thell Aug 12 '12 at 09:45
  • 5
    FYI plyr is slow because it uses data frames which are slow :/ – hadley Sep 28 '13 at 20:35
  • @hadley, not sure what you mean here. Are you suggesting `data.frame`s (as a data structure) are slow? Why so? As Matthew writes [here](http://stackoverflow.com/questions/8991709/why-are-pandas-merges-in-python-faster-than-data-table-merges-in-r#comment11277792_8991709), what could be a better data structure? For ex: comparing `plyr v1.7.1` (on OP's post) and `plyr 1.8` on 1e6 rows with 1e4 unique grps - 34 sec vs 2.2! The diff. seems to come from the fun. `do.ply` in the line `piece <- pieces[[i]]` because of attributes - 86% of the time is spent in `attr` in 1.7.1! – Arun Dec 27 '13 at 00:39
  • @arun it's not the data structure that's the problem, it's the inefficient implementation of all the methods. For example, `[<-.data.frame` is not internal, shouldn't be such dramatic difference in https://gist.github.com/hadley/8150051, etc etc – hadley Dec 27 '13 at 19:07
  • 3
    @hadley, yes of course, but the time-consuming part in plyr (tested with v1.8) seems to be `rbind.fill` and `loop_apply`. Ex: `rbind.fill` has a nested for-loop which runs from 1:150K (on OP's example) on the outer loop and does assignment on each one of the 150K data.frames in the list from within R... Don't you think that's very inefficient as well? This double-for-loop takes 380 seconds out of the total 472 seconds. This is as opposed to 5.5 seconds using `base:::aggregate`. – Arun Dec 27 '13 at 22:23

1 Answers1

51

Why it is so slow? A little research located a mail group posting from a Aug. 2011 where @hadley, the package author, states

This is a drawback of the way that ddply always works with data frames. It will be a bit faster if you use summarise instead of data.frame (because data.frame is very slow), but I'm still thinking about how to overcome this fundamental limitation of the ddply approach.


As for being efficient plyr code I didn't know either. After a bunch of param testing and bench-marking it looks like we can do better.

The summarize() in your command is a just helper function, pure and simple. We can replace it with our own sum function since it isn't helping with anything that isn't already simple and the .data and .(price) arguments can be made more explicit. The result is

ddply( dd[, 2:3], ~price, function(x) sum( x$volume ) )

The summarize may seem nice, but it just isn't quicker than a simple function call. It makes sense; just look at our little function versus the code for summarize. Running your benchmarks with the revised formula yields a noticeable gain. Don't take that to mean you've used plyr incorrectly, you haven't, it just isn't efficient; nothing you can do with it will make it as fast as other options.

In my opinion the optimized function still stinks as it isn't clear and must be mentally parsed along with still being ridiculously slow compared with data.table ( even with a 60% gain ).


In the same thread mentioned above, regarding the slowness of plyr, a plyr2 project is mentioned. Since the time of the original answer to the question the plyr author has released dplyr as the successor of plyr. While both plyr and dplyr are billed as data manipulation tools and your primary stated interest is aggregation you may still be interested in your benchmark results of the new package for comparison as it has a reworked backend to improve performance.

plyr_Original   <- function(dd) ddply( dd, .(price), summarise, ss=sum(volume))
plyr_Optimized  <- function(dd) ddply( dd[, 2:3], ~price, function(x) sum( x$volume ) )

dplyr <- function(dd) dd %.% group_by(price) %.% summarize( sum(volume) )    

data_table <- function(dd) dd[, sum(volume), keyby=price]

The dataframe package has been removed from CRAN and subsequently from the tests, along with the matrix function versions.

Here's the i=5, j=8 benchmark results:

$`obs= 500,000 unique prices= 158,286 reps= 5`
                  test elapsed relative
9     data_table(d.dt)   0.074    1.000
4          dplyr(d.dt)   0.133    1.797
3          dplyr(d.df)   1.832   24.757
6        l.apply(d.df)   5.049   68.230
5        t.apply(d.df)   8.078  109.162
8            agg(d.df)  11.822  159.757
7            b.y(d.df)  48.569  656.338
2 plyr_Optimized(d.df) 148.030 2000.405
1  plyr_Original(d.df) 401.890 5430.946

No doubt the optimizing helped a bit. Take a look at the d.df functions; they just can't compete.

For a little perspective on the slowness of the data.frame structure here are micro-benchmarks of the aggregation times of data_table and dplyr using a larger test dataset (i=8,j=8).

$`obs= 50,000,000 unique prices= 15,836,476 reps= 5`
Unit: seconds
             expr    min     lq median     uq    max neval
 data_table(d.dt)  1.190  1.193  1.198  1.460  1.574    10
      dplyr(d.dt)  2.346  2.434  2.542  2.942  9.856    10
      dplyr(d.df) 66.238 66.688 67.436 69.226 86.641    10

The data.frame is still left in the dust. Not only that, but here's the elapsed system.time to populate the data structures with the test data:

`d.df` (data.frame)  3.181 seconds.
`d.dt` (data.table)  0.418 seconds.

Both creation and aggregation of the data.frame is slower than that of the data.table.

Working with the data.frame in R is slower than some alternatives but as the benchmarks show the built in R functions blow plyr out of the water. Even managing the data.frame as dplyr does, which improves upon the built-ins, doesn't give optimal speed; where as data.table is faster both in creation and aggregation and data.table does what it does while working with/upon data.frames.

In the end...

Plyr is slow because of the way it works with and manages the data.frame manipulation.

[punt:: see the comments to the original question].


## R version 3.0.2 (2013-09-25)
## Platform: x86_64-pc-linux-gnu (64-bit)
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] microbenchmark_1.3-0 rbenchmark_1.0.0     xts_0.9-7           
## [4] zoo_1.7-11           data.table_1.9.2     dplyr_0.1.2         
## [7] plyr_1.8.1           knitr_1.5.22        
## 
## loaded via a namespace (and not attached):
## [1] assertthat_0.1  evaluate_0.5.2  formatR_0.10.4  grid_3.0.2     
## [5] lattice_0.20-27 Rcpp_0.11.0     reshape2_1.2.2  stringr_0.6.2  
## [9] tools_3.0.2

Data-Generating gist .rmd

Thell
  • 5,883
  • 31
  • 55
  • +1. good suggestions. thanks mate. I'm re-running the tests with your suggested `plyr` and `dc` code today. i'll post an answer when they are done. i decided to drop the matrix bit, to speed things up a little (as moving the df into a matrix didn't seem to be adding much anyways). – ricardo Aug 18 '12 at 00:06
  • I accepted this answer, as it seems that's as far as we are going to get - unless Hadley wants to check in and explain the inner workings of `plyr`. – ricardo Aug 27 '12 at 20:39
  • 3
    @Thell Since you've mentioned ease of use I've added what `dtd()` actually is alongside, iiuc. How anyone can say that isn't easy beats me. But dplyr using the data.table back end is slower than using data.table directly, then? How come? – Matt Dowle Sep 28 '13 at 19:56
  • 1
    What's `replications` set to for these benchmark results? It may not be fair to `dplyr` since @Hadley says it has a small amount of overhead. `microbenchmark` may be better because it returns the minimum of a set of runs. The minimum of 3 runs on a larger dataset is my preference. – Matt Dowle Sep 28 '13 at 20:50
  • @MatthewDowle, I totally agree that data.table syntax can't get much simpler, yet I _think_ what Hadley is doing here is allowing the same func to be used across various backends. And, as to why it is slower than direct data.table looks to be the overhead you mention... relating to grouping. As for the replications, it was set at 10 if memory serves. – Thell Sep 28 '13 at 21:34
  • 2
    Three words of advice: Switch to dplyr. It combines an even easier (than plyr) interface with prudent use of data frames and blazing fast C++ "chassis." – c.gutierrez Apr 19 '14 at 17:15