Matrix Averaging of duplicate Rows

Question

I've got a cvs file with 2 columns (see below Matrix 1 for example). I would like to create a program to average the second column of the matrix for all the duplicate numbers in the first column. So for instance in the matrix below there are two rows of "2" in the first column. Those rows would be averaged into one column of ((356+456)/2 = 406) etc. So the final matrix would like the Matrix 2 at the bottom. Any ideas on how to do this?

Matrix 1

mat1 <- structure(c(1, 2, 2, 3, 4, 4, 4, 5, 234, 356, 456, 745, 568, 
            998, 876, 895), .Dim = c(8L, 2L))
mat1
     [,1] [,2]
[1,]    1  234
[2,]    2  356
[3,]    2  456
[4,]    3  745
[5,]    4  568
[6,]    4  998
[7,]    4  876
[8,]    5  895

Matrix 2

mat2 <- structure(c(1, 2, 3, 4, 5, 234, 406, 745, 814, 895), .Dim = c(5L, 2L))
mat2
     [,1] [,2]
[1,]    1  234
[2,]    2  406
[3,]    3  745
[4,]    4  814
[5,]    5  895

Is the first column always in numerical order? – Rich Scriven Nov 24 '14 at 19:45 — Rich Scriven, Nov 24 '14 at 19:45

score 2 · Answer 1 · answered Nov 24 '14 at 19:20

2

using just base R:

> x <- tapply(mat1[,2], mat1[,1], mean)
> matrix(c(as.integer(names(x)), x), ncol = 2)

answered Nov 24 '14 at 19:20

mmuurr

1,310
1
11
21

score 1 · Answer 2 · answered Nov 24 '14 at 19:17

1

The most elementary way would be to use tapply:

tapply(mat1[,2], mat1[,1], mean)

answered Nov 24 '14 at 19:17

Thilo

8,827
2
35
56

That won't return a matrix neither – David Arenburg Nov 24 '14 at 20:04
2

@DavidArenburg True. I'm not sure the OP is best of using a matrix - my impression was that this was just the first datatype that he knew could store what he needed. In my experience a matrix in many cases is not the best datatype to use. Thus, I gave an alternative on purpose. I was going to comment on that, but then my phone rang and I got distracted... – Thilo Nov 24 '14 at 20:46
@DavidArenburg, you're writing these comments as if `as.matrix` is an obscure and hard to apply function. – A5C1D2H2I1M1N2O1R2T1 Nov 25 '14 at 03:39
@AnandaMatho no. I'm writing these comments in order to guide people to try and match OPs desired output instead of just dumping code. `as.matrix`is easy to apply for you but a beginner in R possibly isn't familiar with this function, have you thought about that? – David Arenburg Nov 25 '14 at 04:26
@DavidArenburg, why should we *need* to do that? The question is simply about aggregation. By restricting ourselves to the confines of the OPs expectations, we're just providing free labor to this person, and providing solutions that might be useful only this person. If an answer is provided that proves to be more broadly applicable (and at the same time solves > 90% of the OPs problems), doesn't that seem like a more useful answer in SO? You don't go directly from a CSV to a matrix with no `dimnames` in R, so it's even questionable whether the OPs *reproducible* data is *representative* data. – A5C1D2H2I1M1N2O1R2T1 Nov 25 '14 at 07:53
@AnandaMahto if this question is about calculating mean per group in a *data.frame* it should have been closed imidiatly as it is a duplicate of at least a dozen of similar questions, such as [this](http://stackoverflow.com/questions/25198442/how-to-calculate-mean-median-per-group-in-a-dataframe-in-r), for example – David Arenburg Nov 25 '14 at 08:14

LeoRJorge · Accepted Answer · 2014-11-24T20:46:14.627

1

What about

as.matrix(aggregate(mat1[,2],by = list(mat1[,1]),FUN = mean))

edited Nov 24 '14 at 20:46

answered Nov 24 '14 at 19:17

LeoRJorge

474
1
5
13

That won't return a matrix – David Arenburg Nov 24 '14 at 20:03
@David `as.matrix(aggregate(mat1,by = list(mat1[,1]),FUN = mean))` – LeoRJorge Nov 24 '14 at 20:13
And if you don't want the column created by aggregate `as.matrix(aggregate(mat1,by = list(mat1[,1]),FUN = mean))[,-1]` – LeoRJorge Nov 24 '14 at 20:14
2

Why not just `as.matrix(aggregate(mat1[, 2] ~ mat1[, 1], FUN = mean))`? What do you need the additional variable for? – David Arenburg Nov 24 '14 at 20:18
Your suggestion is a different writing of almost the same. Instead of providing a formula, I used the `by` argument, you can also have the same thing with `as.matrix(aggregate(mat1[,2],by = list(mat1[,1]),FUN = mean))` – LeoRJorge Nov 24 '14 at 20:25
Added the suggestion by @david to the original answer – LeoRJorge Nov 24 '14 at 20:46

Rich Scriven · Answer 4 · 2014-11-25T03:45:00.917

If the first column is always in numerical order, you could try

cbind(unique(mat1[,1]), rowsum(mat1[,2], mat1[,1]) %/% matrix(table(mat1[,1])))
#      [,1] [,2]
# [1,]    1  234
# [2,]    2  406
# [3,]    3  745
# [4,]    4  814
# [5,]    5  895

rowsum is known to be more efficient than aggregate and tapply. However, there are obvious limitations. It would be nice if there was a rowmean function for grouped matrix calculations.

Another base R possibility is

s <- unname(split(mat1[,2], mat1[,1]))
cbind(unique(mat1[,1]), vapply(s, mean, 1))
#      [,1] [,2]
# [1,]    1  234
# [2,]    2  406
# [3,]    3  745
# [4,]    4  814
# [5,]    5  895

And a safer solution of these three would be to convert to data frame. Here I use dplyr for efficiency.

library(dplyr)
df <- group_by(as.data.frame(mat1), V1) %>% summarise(mean(V2))
as.matrix(unname(df))
#      [,1] [,2]
# [1,]    1  234
# [2,]    2  406
# [3,]    3  745
# [4,]    4  814
# [5,]    5  895

Thanks, converting to data frame also works very well – CPL Nov 25 '14 at 23:37 — CPL, Nov 25 '14 at 23:37

score 1 · Answer 5 · answered Nov 25 '14 at 04:15

1

The answer by @LeoRJorge is 98% of the way to the required output, just needs to be unnamed (if that's really required):

unname(as.matrix(aggregate(mat1[,2], list(mat1[,1]), mean)))

     [,1] [,2]
[1,]    1  234
[2,]    2  406
[3,]    3  745
[4,]    4  814
[5,]    5  895

answered Nov 25 '14 at 04:15

goangit

451
4
6

1

So, instead of posting a 2% answer, just suggest an edit to the answer that gets you 98% of the way.... – A5C1D2H2I1M1N2O1R2T1 Nov 25 '14 at 05:54
1

Unfortunately, I don't yet have enough reputation to post comments (except here on my own answers), otherwise I would have done that. Feel free to find some of my answers you think are helpful and vote for them if you like. Thanks. – goangit Nov 25 '14 at 07:58
I wonder in what context having names would be undesired. But if this is the case, unnaming should work without problems. – LeoRJorge Nov 25 '14 at 14:17

Matrix Averaging of duplicate Rows

5 Answers5