Efficient subsettting and column summation for matrix in R

Question

I am sorry if this is a silly question. I am looking to optimize my code, however, I am a newbie in R, so I do not know where to start.

I have a matrix X, whose rows are labeled by elements of y. Set of labels is numeric and consists of {1,...,K}. I want to be able to compute column sum for each submatrix corresponding to different labels and store it in M. To make this more clear, I am providing my current code:

for (i in 1:K) {
    cluster = (y == i)
    if (any(cluster)) {
      clusterRows = X[cluster, , drop = F]
      M[i, ] = colSums(clusterRows)
    }
}

Is there a better, more efficient way to do this? By efficient, I mean the running time.

EDIT: Example.

Input:

set.seed(1)
X = matrix(rnorm(100*2), nrow = 100, ncol = 2)
y = rep(1:2, 50)
M = matrix(rep(0,4), 2)
K = 2

Output:

       [,1]      [,2]
[1,] 9.776280 -2.595435
[2,] 1.112457 -1.185373

EDIT 2: I am not using any libraries besides base. Here is my sessionInfo():

R version 3.4.4 (2018-03-15)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Linux Mint 19.3

Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/openblas/libblas.so.3
LAPACK: /usr/lib/x86_64-linux-gnu/libopenblasp-r0.2.20.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8    LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
[1] microbenchmark_1.4-7 compiler_3.4.4       tools_3.4.4

Please share some sample data and expected output. For a case like this, where the goal is runtime efficiency, the best way to share sample data is to share code that simulates/creates appropriate sample data. (Please use `set.seed()` so any randomness is reproducible!) See many more tips for making reproducible examples [at this FAQ](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example). — Gregor Thomas, Sep 04 '20 at 22:23
You need to create the `M` matrix for this to work. `M <- matrix(rep(0,4), 2)`? — pseudospin, Sep 04 '20 at 23:00
I made an appropriate edit. I am sorry I forgot to include that. — statwanderer, Sep 04 '20 at 23:09

score 5 · Accepted Answer · answered Sep 04 '20 at 23:19

I think what you need is the rowsum() function from base R. This function will take each column of your matrix, and will collapse (or sum) each row of these columns, according to a group column, or a group vector. This a very fast function as you can see here: https://www.brodieg.com/2019/08/22/hydra-loose-ends/, so is problably what you want.

Here is the input:

set.seed(1)
X = matrix(rnorm(100*2), nrow = 100, ncol = 2)
y = rep(1:2, 50)
M = matrix(rep(0,4), 2)
K = 2

Here is the function:

rowsum(X, y)

Here is the result:

      [,1]      [,2]
1 9.776280 -2.595435
2 1.112457 -1.185373

score 2 · Answer 2 · answered Sep 04 '20 at 23:51

2

We could use aggregate from base R

aggregate(X ~ y, FUN = sum)

answered Sep 04 '20 at 23:51

akrun

874,273
37
540
662

Efficient subsettting and column summation for matrix in R

2 Answers2