Equivalent of rowsum function for Matrix-class (dgCMatrix)

Question

For the base R matrix class we have the rowsum function, which is very fast for computing column sums across groups of rows.

Is there an equivalent function or approach implemented in the Matrix-package?

I'm particularly interested in a fast alternative to rowsum for large dgCMatrix-objects (i.e. millions of rows, but roughly 95% sparse).

check `slam` package https://cran.r-project.org/web/packages/slam/slam.pdf — s.brunel, Jun 25 '18 at 14:37

score 6 · Answer 1 · edited Aug 30 '20 at 10:29

6

I know this is an old question, but Matrix::rowSums might be the function you are looking for.

edited Aug 30 '20 at 10:29

Rui Barradas

70,273
8
34
66

answered Nov 26 '18 at 20:06

HowYaDoing

820
2
7
15

`Matrix::rowSums()` is a replacement for `base::rowSums()` (which computes the sum of every row, returning a vector), not `base::rowsum()` (which combines rows in specified groups, returning a matrix with a smaller number of rows) – Mike M Mar 06 '21 at 15:04

score 1 · Answer 2 · answered Oct 15 '21 at 07:18

1

The DelayedArray BioConductor package now has a rowsum function that accepts sparse matrices that has been very fast when I tried it.

answered Oct 15 '21 at 07:18

Etienne Becht

31
6

score 0 · Answer 3 · answered Mar 09 '21 at 00:26

Here is an approach using matrix multiplication, based on an example in https://slowkow.com/notes/sparse-matrix/. First, let's create a sparse matrix to play with,

library(magrittr)
library(forcats)
library(stringr)
library(Matrix)

set.seed(42)
m <- sparseMatrix(
  i = sample(x = 1e4, size = 1e4),
  j = sample(x = 1e4, size = 1e4),
  x = rnorm(n = 1e4)
)
colnames(m) <- str_c("col", seq(ncol(m)))
rownames(m) <- str_c("row", seq(nrow(m)))

and a grouping vector defining which rows to sum,

group <- sample(1:10, nrow(m), replace = TRUE) %>%
  paste0("new_row", .) %>%
  fct_inorder

Whether group is a factor and its level order will affect the final row order in the merged matrix. I made group a factor with levels ordered by first appearance in group to make the row order resemble that from the rowsum() operation with reorder = FALSE.

Next, we create a (sparse) matrix that we can left-multiply by m to get a version of m whose rows have been summed based on group,

group_mat <- sparse.model.matrix(~ 0 + group) %>% t
# Adjust row names to get the correct final row names
rownames(group_mat) <- rownames(group_mat) %>% str_extract("(?<=^group).+")

msum <- group_mat %*% m

The result matches base::rowsum() on the dense version of the matrix,

d <- as.matrix(m)
dsum <- rowsum(d, group, reorder = FALSE)
all.equal(as.matrix(msum), dsum)
#> [1] TRUE

but the sparse-matrix multiplication method is much faster,

bench::mark( msum <- group_mat %*% m )$median
#> [1] 344µs
bench::mark( dsum <- rowsum(d, group) )$median
#> [1] 146ms

Equivalent of rowsum function for Matrix-class (dgCMatrix)

3 Answers3

Linked