7

Does anyone know a slick way to order the results coming out of a ddply summarise operation?

This is what I'm doing to get the output ordered by descending depth.

  ddims <- ddply(diamonds, .(color), summarise, depth = mean(depth), table = mean(table))
  ddims <- ddims[order(-ddims$depth),]

With output...

> ddims
  color    depth    table
7     J 61.88722 57.81239
6     I 61.84639 57.57728
5     H 61.83685 57.51781
4     G 61.75711 57.28863
1     D 61.69813 57.40459
3     F 61.69458 57.43354
2     E 61.66209 57.49120

Not too ugly, but I'm hoping for a way do it nicely within ddply(). Anyone know how?

Hadley's ggplot2 book has this example for ddply and subset but it's not actually sorting the output, just selecting the two smallest diamonds per group.

ddply(diamonds, .(color), subset, order(carat) <= 2)
Matt Dowle
  • 58,872
  • 22
  • 166
  • 224
Tommy O'Dell
  • 7,019
  • 13
  • 56
  • 69
  • I'm not sure there's something you can do "on the fly" -- but just a random note, instead of `ddims[order(-ddims$depth),]`, you might try `ddims[order(ddims$depth, decreasing=TRUE),]`. This way you don't have to make a new 'negative' vector object. – Steve Lianoglou Apr 30 '11 at 18:16

4 Answers4

8

I'll use this occasion to advertise a bit for data.table, which is faster to run and (in my perception) at least as elegant to write:

library(data.table)
ddims <- data.table(diamonds)
system.time(ddims <- ddims[, list(depth=mean(depth), table=mean(table)), by=color][order(depth)])

   user  system elapsed 
  0.003   0.000   0.004 

By contrast, without ordering, your ddply code already takes 30 times longer:

  user  system elapsed 
 0.106   0.010   0.119

With all the respect I have for Hadley's excellent work, e.g. on ggplot2, and general awesomeness, I must confess that for me, data.table entirely replaced ddply -- for speed reasons.

crayola
  • 1,668
  • 13
  • 16
  • Thanks mate. I was unaware of the `data.table` package. Looks mighty quick and that's quite readable too. I'll be playing with some big data sets in the near future, so thanks for that. I'm going to wait to see if anyone chimes in with a `ddply` specific answer. – Tommy O'Dell Apr 30 '11 at 08:28
4

Yes, to sort you can just nest the ddply in another ddply. Here's how you would use ddply to sort on one column, for example your table column:

ddimsSortedTable <- ddply(ddply(diamonds, .(color), 
  summarise, depth = mean(depth), table = mean(table)), .(table))

  color    depth    table
1     G 61.75711 57.28863
2     D 61.69813 57.40459
3     F 61.69458 57.43354
4     E 61.66209 57.49120
5     H 61.83685 57.51781
6     I 61.84639 57.57728
7     J 61.88722 57.81239
Ben
  • 41,615
  • 18
  • 132
  • 227
  • This sounds soo unlogic and doesn't look nice. Generally this means bad code. Is this really the way to go? – CousinCocaine Apr 11 '14 at 18:58
  • Why not add your own answer and show a better method? – Ben Apr 12 '14 at 03:34
  • I get your comment, and my post sounds more negative than I intended. I came here because this was also my question. I solved it by saving my dataframe as `df` and then did a `df[ order(df$column, ]`. So I first save it to a dataframe and than order it. – CousinCocaine Apr 12 '14 at 20:13
3

If you are using dplyr, I would recommend taking advantage of the %.% operator, which reads to more intuitive code.

data(diamonds, package = 'ggplot2')
library(dplyr)
diamonds %.%
  group_by(color) %.%
  summarise(
    depth = mean(depth),
    table = mean(table)
  ) %.%
  arrange(desc(depth))
Ramnath
  • 54,439
  • 16
  • 125
  • 152
  • Why are most of the answers to R questions black magic? Please explain where the %.% operator is documented and/or what it does. It's not something you can easily find with Google. – reinierpost Mar 06 '15 at 15:36
  • 1
    `help("%.%", package = 'dplyr')` – Ramnath Mar 09 '15 at 17:10
1

A bit late to the party, but things might be a bit different with dplyr. Borrowing crayola's solution for data.table:

dat1 <- microbenchmark(
dtbl<- data.table(diamonds)[, list(depth=mean(depth), table=mean(table)), by=color][order(-   depth)],
dplyr_dtbl <- arrange(summarise(group_by(tbl_dt(diamonds),color), depth = mean(depth) , table =  mean(table)),-depth),
dplyr_dtfr <- arrange(summarise(group_by(tbl_df(diamonds),color), depth = mean(depth) , table = mean(table)),-depth),
times = 20, 
unit = "ms"
)

The results show that dplyr with tbl_dt is a bit slower than the data.table approach. However, dplyr with data.frame is faster:

         expr       min        lq    median        uq       max neval
      data.table  9.606571 10.968881 11.958644 12.675205 14.334525    20
dplyr_data.table 13.553307 15.721261 17.494500 19.544840 79.771768    20
dplyr_data.frame  4.643799  5.148327  5.887468  6.537321  7.043286    20

Note: I have obviously changed the names so the microbenchmark results are more readable

Slak
  • 578
  • 10
  • 13