3

I've got a cvs file with 2 columns (see below Matrix 1 for example). I would like to create a program to average the second column of the matrix for all the duplicate numbers in the first column. So for instance in the matrix below there are two rows of "2" in the first column. Those rows would be averaged into one column of ((356+456)/2 = 406) etc. So the final matrix would like the Matrix 2 at the bottom. Any ideas on how to do this?

Matrix 1

mat1 <- structure(c(1, 2, 2, 3, 4, 4, 4, 5, 234, 356, 456, 745, 568, 
            998, 876, 895), .Dim = c(8L, 2L))
mat1
     [,1] [,2]
[1,]    1  234
[2,]    2  356
[3,]    2  456
[4,]    3  745
[5,]    4  568
[6,]    4  998
[7,]    4  876
[8,]    5  895

Matrix 2

mat2 <- structure(c(1, 2, 3, 4, 5, 234, 406, 745, 814, 895), .Dim = c(5L, 2L))
mat2
     [,1] [,2]
[1,]    1  234
[2,]    2  406
[3,]    3  745
[4,]    4  814
[5,]    5  895
Rich Scriven
  • 97,041
  • 11
  • 181
  • 245
CPL
  • 161
  • 2
  • 10

5 Answers5

2

using just base R:

> x <- tapply(mat1[,2], mat1[,1], mean)
> matrix(c(as.integer(names(x)), x), ncol = 2)
mmuurr
  • 1,310
  • 1
  • 11
  • 21
1

The most elementary way would be to use tapply:

tapply(mat1[,2], mat1[,1], mean)
Thilo
  • 8,827
  • 2
  • 35
  • 56
  • That won't return a matrix neither – David Arenburg Nov 24 '14 at 20:04
  • 2
    @DavidArenburg True. I'm not sure the OP is best of using a matrix - my impression was that this was just the first datatype that he knew could store what he needed. In my experience a matrix in many cases is not the best datatype to use. Thus, I gave an alternative on purpose. I was going to comment on that, but then my phone rang and I got distracted... – Thilo Nov 24 '14 at 20:46
  • @DavidArenburg, you're writing these comments as if `as.matrix` is an obscure and hard to apply function. – A5C1D2H2I1M1N2O1R2T1 Nov 25 '14 at 03:39
  • @AnandaMatho no. I'm writing these comments in order to guide people to try and match OPs desired output instead of just dumping code. `as.matrix`is easy to apply for you but a beginner in R possibly isn't familiar with this function, have you thought about that? – David Arenburg Nov 25 '14 at 04:26
  • @DavidArenburg, why should we *need* to do that? The question is simply about aggregation. By restricting ourselves to the confines of the OPs expectations, we're just providing free labor to this person, and providing solutions that might be useful only this person. If an answer is provided that proves to be more broadly applicable (and at the same time solves > 90% of the OPs problems), doesn't that seem like a more useful answer in SO? You don't go directly from a CSV to a matrix with no `dimnames` in R, so it's even questionable whether the OPs *reproducible* data is *representative* data. – A5C1D2H2I1M1N2O1R2T1 Nov 25 '14 at 07:53
  • @AnandaMahto if this question is about calculating mean per group in a *data.frame* it should have been closed imidiatly as it is a duplicate of at least a dozen of similar questions, such as [this](http://stackoverflow.com/questions/25198442/how-to-calculate-mean-median-per-group-in-a-dataframe-in-r), for example – David Arenburg Nov 25 '14 at 08:14
1

What about

as.matrix(aggregate(mat1[,2],by = list(mat1[,1]),FUN = mean))
LeoRJorge
  • 474
  • 1
  • 5
  • 13
1

If the first column is always in numerical order, you could try

cbind(unique(mat1[,1]), rowsum(mat1[,2], mat1[,1]) %/% matrix(table(mat1[,1])))
#      [,1] [,2]
# [1,]    1  234
# [2,]    2  406
# [3,]    3  745
# [4,]    4  814
# [5,]    5  895

rowsum is known to be more efficient than aggregate and tapply. However, there are obvious limitations. It would be nice if there was a rowmean function for grouped matrix calculations.

Another base R possibility is

s <- unname(split(mat1[,2], mat1[,1]))
cbind(unique(mat1[,1]), vapply(s, mean, 1))
#      [,1] [,2]
# [1,]    1  234
# [2,]    2  406
# [3,]    3  745
# [4,]    4  814
# [5,]    5  895

And a safer solution of these three would be to convert to data frame. Here I use dplyr for efficiency.

library(dplyr)
df <- group_by(as.data.frame(mat1), V1) %>% summarise(mean(V2))
as.matrix(unname(df))
#      [,1] [,2]
# [1,]    1  234
# [2,]    2  406
# [3,]    3  745
# [4,]    4  814
# [5,]    5  895
Rich Scriven
  • 97,041
  • 11
  • 181
  • 245
1

The answer by @LeoRJorge is 98% of the way to the required output, just needs to be unnamed (if that's really required):

unname(as.matrix(aggregate(mat1[,2], list(mat1[,1]), mean)))

     [,1] [,2]
[1,]    1  234
[2,]    2  406
[3,]    3  745
[4,]    4  814
[5,]    5  895
goangit
  • 451
  • 4
  • 6
  • 1
    So, instead of posting a 2% answer, just suggest an edit to the answer that gets you 98% of the way.... – A5C1D2H2I1M1N2O1R2T1 Nov 25 '14 at 05:54
  • 1
    Unfortunately, I don't yet have enough reputation to post comments (except here on my own answers), otherwise I would have done that. Feel free to find some of my answers you think are helpful and vote for them if you like. Thanks. – goangit Nov 25 '14 at 07:58
  • I wonder in what context having names would be undesired. But if this is the case, unnaming should work without problems. – LeoRJorge Nov 25 '14 at 14:17