5

For the base R matrix class we have the rowsum function, which is very fast for computing column sums across groups of rows.

Is there an equivalent function or approach implemented in the Matrix-package?

I'm particularly interested in a fast alternative to rowsum for large dgCMatrix-objects (i.e. millions of rows, but roughly 95% sparse).

Konrad Rudolph
  • 530,221
  • 131
  • 937
  • 1,214
Misconstruction
  • 1,839
  • 4
  • 17
  • 23

3 Answers3

6

I know this is an old question, but Matrix::rowSums might be the function you are looking for.

Rui Barradas
  • 70,273
  • 8
  • 34
  • 66
HowYaDoing
  • 820
  • 2
  • 7
  • 15
  • `Matrix::rowSums()` is a replacement for `base::rowSums()` (which computes the sum of every row, returning a vector), not `base::rowsum()` (which combines rows in specified groups, returning a matrix with a smaller number of rows) – Mike M Mar 06 '21 at 15:04
1

The DelayedArray BioConductor package now has a rowsum function that accepts sparse matrices that has been very fast when I tried it.

0

Here is an approach using matrix multiplication, based on an example in https://slowkow.com/notes/sparse-matrix/. First, let's create a sparse matrix to play with,

library(magrittr)
library(forcats)
library(stringr)
library(Matrix)

set.seed(42)
m <- sparseMatrix(
  i = sample(x = 1e4, size = 1e4),
  j = sample(x = 1e4, size = 1e4),
  x = rnorm(n = 1e4)
)
colnames(m) <- str_c("col", seq(ncol(m)))
rownames(m) <- str_c("row", seq(nrow(m)))

and a grouping vector defining which rows to sum,

group <- sample(1:10, nrow(m), replace = TRUE) %>%
  paste0("new_row", .) %>%
  fct_inorder

Whether group is a factor and its level order will affect the final row order in the merged matrix. I made group a factor with levels ordered by first appearance in group to make the row order resemble that from the rowsum() operation with reorder = FALSE.

Next, we create a (sparse) matrix that we can left-multiply by m to get a version of m whose rows have been summed based on group,

group_mat <- sparse.model.matrix(~ 0 + group) %>% t
# Adjust row names to get the correct final row names
rownames(group_mat) <- rownames(group_mat) %>% str_extract("(?<=^group).+")

msum <- group_mat %*% m  

The result matches base::rowsum() on the dense version of the matrix,

d <- as.matrix(m)
dsum <- rowsum(d, group, reorder = FALSE)
all.equal(as.matrix(msum), dsum)
#> [1] TRUE

but the sparse-matrix multiplication method is much faster,

bench::mark( msum <- group_mat %*% m )$median
#> [1] 344µs
bench::mark( dsum <- rowsum(d, group) )$median
#> [1] 146ms
Mike M
  • 81
  • 1
  • 6