1

I'm trying to use the aggregate function with cbind, but I must be missing something.

I've seen in Using Aggregate for Multiple Aggregations that I can simply define which column I want to be fixed and which I'd like to add, but I just can't get the result I expected.

I have:

x <- data.frame(alfa = 1:9, beta = rep(1:3, 3))

  alfa beta
1    1    1
2    2    2
3    3    3
4    4    1
5    5    2
6    6    3
7    7    1
8    8    2
9    9    3

And I want to retrieve the mean of the entries aggregated by the ones in column beta. For that I've tried:

aggregate(cbind(alfa) ~ beta, data = x, FUN = function(x) c(gama = mean(x)) )

That gives me:

  beta alfa
1    1    4
2    2    5
3    3    6

Shouldn't the result be something like:

  alfa beta gama
1    1    1    4
2    2    2    5
3    3    3    6

How do I force the addition of column gama? Additionally, would someone clarify the basis of the cbind() function? I've been struggling to understand it. Regards!

Community
  • 1
  • 1
Rubens
  • 14,478
  • 11
  • 63
  • 92

1 Answers1

6

Aggregate takes all elements on the left side of the ~ and uses the given function on those values, while they are grouped by the values of the right side. Thus, your command

aggregate(alfa ~ beta, data=x, mean)

will return the mean values of alfa grouped by beta. (As you mentioned SQL - this is the same as will happen with the SQL-clause SELECT beta, avg(alfa) FROM x GROUP BY beta)

If you want to output the first value encountered, this basically is another aggregation that you want to do, thus your aggregation function has to return two values:

aggregate(alfa ~ beta, data=x, function(x) c(alfa=x[1], gamma=mean(x)))

(Again in SQL: SELECT beta, min(alfa), avg(alfa) FROM x GROUP BY beta)

You asked about the cbind. As long as you have only one argument on the left hand side, this does not matter at all. But suppose you have the following situation:

x <- data.frame(alfa = 1:9, beta = rep(1:3, 3), gamma = rnorm(9))

and would like to compute, say, the mean of both columns alfa and gamma, you could do it like this:

aggregate(cbind(alfa, gamma) ~ beta, data=x, function(x) mean(x))

That way you just tell the aggregate function to use throw alfa and gamma both at the given function.

For more and exhaustive examples, see ?aggregate.


Edit

You have to be careful not to mix different meanings of cbind. Used a separate function, it concats two vectors (or data.frames) of the same length to a matrix (or data.frame) with both inputs as different columns:

> cbind(1:3, 7:9)
     [,1] [,2]
[1,]    1    7
[2,]    2    8
[3,]    3    9

Used in the formula notation of aggregate cbind does something related but yet different. cbind(column1, column2) just tells aggregate to use the given function on both rows seperately. Thus, something like

aggregate(cbind(alfa, gamma) ~ beta, data=x, function(x) mean(x[,1]*x[,2]))

will not work. Rather, the function will be called two times - once with the values of alfa, then with the values of beta.

Hope that clarifies your understanding.

Thilo
  • 8,827
  • 2
  • 35
  • 56
  • Thanks for your response, it works now. And now i understand the application of this cbind() function. Just confirming, cbind() converts the arguments into an unique c() structure, that could be indexed inside function(x), right? – Rubens Nov 29 '12 at 19:45