55

I have a data-frame likeso:

x <-
id1 id2    val1  val2 val3 val4
1   a   x    1    9
2   a   x    2    4
3   a   y    3    5
4   a   y    4    9
5   b   x    1    7
6   b   y    4    4
7   b   x    3    9
8   b   y    2    8

I wish to aggregate the above by id1 & id2. I want to be able to get the means for val1, val2, val3, val4 at the same time.

How do i do this?

This is what i currently have but it works just for 1 column:

agg <- aggregate(x$val1, list(id11 = x$id1, id2= x$id2), mean)
names(agg)[3] <- c("val1")  # Rename the column

Also, how do i rename the columns which are outputted as means in the same statement given above

Rookie
  • 5,179
  • 13
  • 41
  • 65

2 Answers2

88

We can use the formula method of aggregate. The variables on the 'rhs' of ~ are the grouping variables while the . represents all other variables in the 'df1' (from the example, we assume that we need the mean for all the columns except the grouping), specify the dataset and the function (mean).

aggregate(.~id1+id2, df1, mean)

Or we can use summarise_each from dplyr after grouping (group_by)

library(dplyr)
df1 %>%
    group_by(id1, id2) %>% 
    summarise_each(funs(mean))

Or using summarise with across (dplyr devel version - ‘0.8.99.9000’)

df1 %>% 
    group_by(id1, id2) %>%
    summarise(across(starts_with('val'), mean))

Or another option is data.table. We convert the 'data.frame' to 'data.table' (setDT(df1), grouped by 'id1' and 'id2', we loop through the subset of data.table (.SD) and get the mean.

library(data.table)
setDT(df1)[, lapply(.SD, mean), by = .(id1, id2)] 

data

df1 <- structure(list(id1 = c("a", "a", "a", "a", "b", "b", 
"b", "b"
), id2 = c("x", "x", "y", "y", "x", "y", "x", "y"), 
val1 = c(1L, 
2L, 3L, 4L, 1L, 4L, 3L, 2L), val2 = c(9L, 4L, 5L, 9L, 7L, 4L, 
9L, 8L)), .Names = c("id1", "id2", "val1", "val2"), 
class = "data.frame", row.names = c("1", 
"2", "3", "4", "5", "6", "7", "8"))
akrun
  • 874,273
  • 37
  • 540
  • 662
  • thanks for the comment. Is df1 suppsed to be a function which needs ot be defined? It will be very helpful if you can explain the above syntax – Rookie Dec 30 '15 at 06:00
  • @Rookie `df1` is the data.frame object – akrun Dec 30 '15 at 06:01
  • What's the intuition about the formula in the aggregate version involving a sum, if it's a cartesian product really? Or perhaps it's that it treats the sum as the unique identifying value? So i.e. (a,x) and (a,y) are identified by a+x != a+y, rather than (a=a,x!=y). – mavavilj Jun 14 '21 at 07:16
35

You could try:

agg <- aggregate(list(x$val1, x$val2, x$val3, x$val4), by = list(x$id1, x$id2), mean)
Poorkenny
  • 1,246
  • 11
  • 16
  • please elaborate your answer-review – Ravi Teja Kumar Isetty Sep 08 '16 at 17:07
  • 2
    Nice going (+1). So if you provide a list instead of a vector as argument x (first one), `aggregate` will compute the aggregate for each value in the list. – Ricardo Jun 28 '17 at 13:31
  • 11
    Also, you can rename all columns returning from the `aggregate` call, by giving each list component a name, such as `agg <- aggregate(list(val1=x$val1, val2=x$val2, val3=x$val3, val4=x$val4), by = list(id1=x$id1, id2=x$id2), mean)`. This will replace the default column names assigned by `aggregate`. – Ricardo Jun 28 '17 at 13:34
  • 2
    Can I also change the function for each aggregation?? (example: sum val1 but take mean of val2 and val3) – theforestecologist Apr 29 '18 at 17:35
  • it will auto-sort the columns which is not desirable in every case – Deep Jan 12 '19 at 11:06