1

I have a matrix with a grouping column and a column to sum up values.

I want to split the matrix into groups, sum up some values (by keeping the same length of vectors), unsplit it and assign them to a new column?

What is the most performant and base-R-onic way of doing that?

The winner for now is an lapply function, but I am wondering if there is another function I am missing? Something like stats::aggregate which keeps the same structure?

I would like to stay in base R and keep matrices, so no dplyr or data.table ;).

Edit1: I included the aggregate + merge and the sapply solution powered by @IceCreamToucan. (Thanks for that). Its not a very fair comparison for aggregate as I first convert to data.frames and then back to matrices.

Edit2: With bigger matrices and 100 groups ave outperforms the other functions. Thanks to @Gregor for this one.

set.seed(104)
smpl = sample(1:100, size = 10000, T)
mat0 <- as.matrix(data.frame(
  group=smpl,
  sum=sample(seq(0,100,10), length(smpl), T)
))
mat1 <- cbind(mat0, "sums"=0)


library(microbenchmark)
check <- function(values) {
  all(sapply(values[-1], function(x) all.equal(values[[1]], x)))}
mf = microbenchmark(#check = check,
                    forloop = {
                      mat <- mat1
                      for (z in unique(mat[,'group'])) {
                        mat[mat[,'group'] == z,'sums'] = sum(mat[mat[,'group'] == z,'sum'])
                      }
                      mat
                    },
                    lapply = {
                      mat <- mat1
                      mat[,'sums'] <- unlist(lapply(unique(mat[,'group']), function(i) {
                        sums = sum(mat[mat[,'group'] == i,'sum'])
                        rep(sums, length(mat[mat[,'group'] == i,'sum']))
                      }))
                      mat
                    },
                    sapply = {
                      mat <- mat1
                      mat <- mat[order(mat[,'group']),]
                      mat[,'sums'] <- rep(sapply(split(mat[, 'sum'], mat[, 'group']), sum), 
                                          table(mat[, 'group']))
                      mat
                    },
                    ave = {
                      mat <- mat1
                      mat[,'sums'] <- ave(x = mat[, 'sum'], mat[, 'group'], FUN = sum)
                      mat[order(mat[,'group']),]
                    },
                    aggregate = {
                      matA <- mat0
                      matA <- matA[order(matA[,'group']),]
                      res = aggregate(sum ~ group, FUN = sum, data = matA)
                      matdf = data.frame(matA)
                      base::merge(res, matdf, by ="group")
                    }
)
mf
Unit: milliseconds
      expr      min       lq     mean   median       uq       max neval cld
   forloop 19.94083 25.73131 25.95823 25.97898 26.58043  38.68300   100  bc
    lapply 15.96057 21.44226 24.23693 21.88130 22.41287 311.00252   100  bc
    sapply 21.89081 22.41981 23.42291 22.70492 23.04978  37.41853   100  b 
       ave 11.79256 12.08868 12.51119 12.27613 12.52803  18.20577   100 a  
 aggregate 26.54753 27.31484 29.09592 27.71163 28.71937  54.75284   100   c
SeGa
  • 9,454
  • 3
  • 31
  • 70
  • If you were using data frames, you could `merge` `mat1` with `aggregate(sum ~ group, FUN = sum, data = mat1)`. – IceCreamToucan Jan 29 '19 at 20:11
  • That works with matrices too, but it drops the values. The original matrix has nrow 8, and the `aggregate` result has nrow 4. Or do you mean the merging needs data.frames? – SeGa Jan 29 '19 at 20:14
  • Yeah the `aggregate` part works, but not `merge`, which is how you would replicate the individual values. – IceCreamToucan Jan 29 '19 at 20:15
  • 1
    You could get the `sums` column as `rep(sapply(split(mat[, 'sum'], mat[, 'group']), sum), table(mat[, 'group']))`. I think this is more confusing than just using data frames, but if you're married to matrices that's how I'd do it. – IceCreamToucan Jan 29 '19 at 20:16
  • Thats nice, but unfortunately still slower than lapply. I'll include it in my answer and benchmark if you dont mind? – SeGa Jan 29 '19 at 20:20
  • 1
    Interesting. You can incude it, sure. – IceCreamToucan Jan 29 '19 at 20:21
  • 1
    `ave` is faster than for loop but still about 4-5x slower than `lapply`. `ave(x = mat1[, 'sum'], mat1[, 'group'], FUN = sum)` – Gregor Thomas Jan 29 '19 at 20:27
  • 2
    ...though with the timings being in microseconds I wonder if the example is large enough to be good. If you're really worried about performance I recommend doing an example with, say, at least 100 groups with a few rows each. – Gregor Thomas Jan 29 '19 at 20:30
  • I increased the matrix size, and it seems for big matrices, 'ave' is the new winner. – SeGa Jan 29 '19 at 21:03
  • 1
    And `ave` is just `split` and `lapply` internally ;) – Gregor Thomas Jan 29 '19 at 21:10
  • At this point, rather than editing solutins into your question, I would recommend you write up an answer. – Gregor Thomas Jan 29 '19 at 21:28
  • @Gregor: Since you came up with the fastest option, I would like to credit your answer, fi you want to write one. – SeGa Jan 29 '19 at 22:02

1 Answers1

1

Consulting various R-FAQ (how to sum by group?, Grouping functions and the *apply family), the base R function for the purpose of summing by group without aggregation is ave:

ave(x = mat1[, 'sum'], mat1[, 'group'], FUN = sum)

As edited into the question, ave is quite fast when there are lots of groups.

Gregor Thomas
  • 136,190
  • 20
  • 167
  • 294