0

I want to group by on "col1", "col2" and get the mean of col3

newData
id    col1   col2    col3   col4   col5
1     200    2000    150     151    NA
2     200    2000    250     160   "tim"
3     201    2000    300     189    NA
4     201    2000    400     182    NA

I want my output to be

id    col1   col2    col3   col4    col5
1     200    2000    200     151     NA    
2     201    2000    350     189     NA

aggdata <-aggregate(newData, 
                by=list(newData$col1,newData$col2), 
                FUN=mean, na.rm=TRUE)

this gives me the mean of all variables which I do not want.

A5C1D2H2I1M1N2O1R2T1
  • 190,393
  • 28
  • 405
  • 485
user1946217
  • 1,733
  • 6
  • 31
  • 40
  • 2
    Can you explain what's going on with `col4` and `col5`? Also, why pick an example where `col2` is constant everywhere... Last, please give us the code to create `newData` (you can use `dput` for that) so as to make things easier for us. – flodel Sep 05 '13 at 10:39
  • 1
    Agree. Also, the do-something-by-group is well described [here](http://stackoverflow.com/questions/3505701/r-grouping-functions-sapply-vs-lapply-vs-apply-vs-tapply-vs-by-vs-aggrega) and more recently [here](http://stackoverflow.com/questions/18593482/subtract-pairs-of-columns-based-on-matching-column/18593572#18593572). The examples can easily be extended to more than one grouping variable. Please see help pages for the functions mentioned in the two posts. – Henrik Sep 05 '13 at 10:46

2 Answers2

1

Perhaps you are looking for the merge of two aggregates:

out <- merge(aggregate(col3 ~ col1 + col2, mydf, mean, na.rm = TRUE),
             aggregate(cbind(col4, col5) ~ col1 + col2, 
                       mydf, `[`, 1, na.action = na.pass),
             by = c("col1", "col2"))
out <- cbind(id = 1:nrow(out), out)
out
#   id col1 col2 col3 col4 col5
# 1  1  200 2000  200  151 <NA>
# 2  2  201 2000  350  189 <NA>

The first aggregate takes the mean of "col3". The second aggregate extracts the first element of "col4" and "col5" respectively.

I've manually created an "id" column because in your example, the "id" column in the output doesn't seem to fit any discernible pattern.

A5C1D2H2I1M1N2O1R2T1
  • 190,393
  • 28
  • 405
  • 485
1

It doesn't make much sense, but here's an alternative

> cbind(aggregate(col3~col1+col2, data=newData, FUN="mean"),
        newData[!duplicated(newData[, "col1"]), c("col4", "col5")])
  col1 col2 col3 col4 col5
1  200 2000  200  151 <NA>
3  201 2000  350  189 <NA>
Jilber Urbina
  • 58,147
  • 10
  • 114
  • 138