1

I am sorry if this is a silly question. I am looking to optimize my code, however, I am a newbie in R, so I do not know where to start.

I have a matrix X, whose rows are labeled by elements of y. Set of labels is numeric and consists of {1,...,K}. I want to be able to compute column sum for each submatrix corresponding to different labels and store it in M. To make this more clear, I am providing my current code:

for (i in 1:K) {
    cluster = (y == i)
    if (any(cluster)) {
      clusterRows = X[cluster, , drop = F]
      M[i, ] = colSums(clusterRows)
    }
}

Is there a better, more efficient way to do this? By efficient, I mean the running time.

EDIT: Example.

Input:

set.seed(1)
X = matrix(rnorm(100*2), nrow = 100, ncol = 2)
y = rep(1:2, 50)
M = matrix(rep(0,4), 2)
K = 2

Output:

       [,1]      [,2]
[1,] 9.776280 -2.595435
[2,] 1.112457 -1.185373

EDIT 2: I am not using any libraries besides base. Here is my sessionInfo():

R version 3.4.4 (2018-03-15)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Linux Mint 19.3

Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/openblas/libblas.so.3
LAPACK: /usr/lib/x86_64-linux-gnu/libopenblasp-r0.2.20.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8    LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
[1] microbenchmark_1.4-7 compiler_3.4.4       tools_3.4.4  
statwanderer
  • 113
  • 6
  • 3
    Please share some sample data and expected output. For a case like this, where the goal is runtime efficiency, the best way to share sample data is to share code that simulates/creates appropriate sample data. (Please use `set.seed()` so any randomness is reproducible!) See many more tips for making reproducible examples [at this FAQ](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example). – Gregor Thomas Sep 04 '20 at 22:23
  • I have added the `input/output` + `sessionInfo()`. – statwanderer Sep 04 '20 at 22:56
  • 1
    You need to create the `M` matrix for this to work. `M <- matrix(rep(0,4), 2)`? – pseudospin Sep 04 '20 at 23:00
  • I made an appropriate edit. I am sorry I forgot to include that. – statwanderer Sep 04 '20 at 23:09

2 Answers2

5

I think what you need is the rowsum() function from base R. This function will take each column of your matrix, and will collapse (or sum) each row of these columns, according to a group column, or a group vector. This a very fast function as you can see here: https://www.brodieg.com/2019/08/22/hydra-loose-ends/, so is problably what you want.

Here is the input:

set.seed(1)
X = matrix(rnorm(100*2), nrow = 100, ncol = 2)
y = rep(1:2, 50)
M = matrix(rep(0,4), 2)
K = 2

Here is the function:

rowsum(X, y)

Here is the result:

      [,1]      [,2]
1 9.776280 -2.595435
2 1.112457 -1.185373

Pedro Faria
  • 707
  • 3
  • 7
2

We could use aggregate from base R

aggregate(X ~ y, FUN = sum)
akrun
  • 874,273
  • 37
  • 540
  • 662