7

Can someone help me get the difference between aggregate and ddply with the following example:

A data frame:

mydat <- data.frame(first = rpois(10,10), second = rpois(10,10), 
                    third = rpois(10,10), group = c(rep("a",5),rep("b",5)))

Use aggregate to apply a function to a part of the data frame split by a factor:

aggregate(mydat[,1:3], by=list(mydat$group), mean)
  Group.1 first second third
1       a   8.8    8.8  10.2
2       b   6.8    9.4  13.4

Try to use aggregate for another function (returns an error message):

aggregate(mydat[,1:3], by=list(mydat$group), function(u) cor(u$first,u$second))
Error in u$second : $ operator is invalid for atomic vectors

Now, try the same with ddply (plyr package):

ddply(mydat, .(group), function(u) cor(u$first,u$second))
  group         V1
1     a -0.5083042
2     b -0.6329968

All tips, links, criticism are highly appreciated.

IRTFM
  • 258,963
  • 21
  • 364
  • 487
skip
  • 297
  • 1
  • 6
  • 16
  • I think you've demonstrated the difference. Or is there a question here? – Matthew Lundberg Jan 05 '13 at 21:57
  • 3
    Well, while I see there is a difference, I don't understand why it is so. What within these functions causes the difference I show? – skip Jan 05 '13 at 21:59
  • 2
    Part 5 of http://programming-r-pro-bro.blogspot.com/2012/12/r-faqs-for-fresh-starters.html has an awesome explanation with sample code. Basically, ddply will allow you to use DIFFERENT functions on each variable, whereas aggregate forces you to use the same function on all columns you pass. – d_a_c321 Oct 18 '13 at 17:01

3 Answers3

14

aggregate calls FUN on each column independently, which is why you get independent means. ddply is going to pass all columns to the function. A quick demonstration of what is being passed in aggregate may be in order:

Some sample data for demonstration:

d <- data.frame(a=1:4, b=5:8, c=c(1,1,2,2))

> d
  a b c
1 1 5 1
2 2 6 1
3 3 7 2
4 4 8 2

By using the function print and ignoring the result of the commands aggregate or ddply, we can see what gets passed to the function in each iteration.

aggregate:

tmp <- aggregate(d[1:2], by=list(d$c), print)
[1] 1 2
[1] 3 4
[1] 5 6
[1] 7 8

Note that individual columns are sent to print.

ddply:

tmp <- ddply(d, .(c), print)
  a b c
1 1 5 1
2 2 6 1
  a b c
3 3 7 2
4 4 8 2

Note that data frames are being sent to print.

Matthew Lundberg
  • 42,009
  • 6
  • 90
  • 112
  • Cool, thanks: the first two sentences contain exactly what I was looking for. – skip Jan 05 '13 at 22:05
  • @Arun What I pasted is the output from `print` within `aggregate`. I omitted the result from `aggregate` which is not important for this example. I'll edit the answer to better indicate this. – Matthew Lundberg Jan 06 '13 at 16:40
8

You've already been told why aggregate was the wrong {base} function to use for a function that requires two vectors as arguments, but you haven't yet been told which non-ddply approach would have succeeded.

The by( ... grp, FUN) method:

> cbind (by( mydat, mydat["group"], function(d) cor(d$first, d$second)) )
        [,1]
a  0.6529822
b -0.1964186

The sapply(split( ..., grp), fn) method

> sapply(  split( mydat, mydat["group"]), function(d) cor(d$first, d$second)) 
         a          b 
 0.6529822 -0.1964186 
IRTFM
  • 258,963
  • 21
  • 364
  • 487
  • 1
    I always feel guilty when I resort to `plyr/ddply`, because I means I failed or was too impatient understanding aggregate, by, or tapply. Note however that `ddply` is so general it can be slow for large problems. A fast version of ddply is in growing here: https://github.com/hadley/dplyr . – Dieter Menne Jan 06 '13 at 10:04
6

The answer of @MatthewLundberg, is very good , mine it is not an answer but it is just a general hint that I use when I want to see what happen behind some R functions call. I use the debug command browser.

aggregate(mydat[,1:3], by=list(mydat$group), 
+           function(x){
+             browser()
+             mean(x)
+           })
Called from: FUN(X[[1L]], ...)
Browse[1]> x
[1] 16 10 16 13 25

then for ddply

ddply(mydat, .(group), function(u) {
+   browser()
+   cor(u$first,u$second)
+   })
Called from: .fun(piece, ...)
Browse[1]> u
  first second third group
1    16      8     9     a
2    10      6     6     a
3    16      6    10     a
4    13      8    10     a
5    25     10     4     a

Edit debug the error by yourself

Here I use the technique to see why you get an error

aggregate(mydat[,1:3], by=list(mydat$group), function(u) {
+   browser()
+   cor(u$first,u$second)
+   })
Called from: FUN(X[[1L]], ...)
Browse[1]> u
[1] 16 10 16 13 25    

As you see here u is an atomic vector (without column names) So if you try

Browse[1]> u$first

You get an error :

Error in u$first : $ operator is invalid for atomic vectors
agstudy
  • 119,832
  • 17
  • 199
  • 261