1

I have a huge sparse with all zeros and I would like to replace some of its cells to value 1 based on indices from another matrix. Note that different cells will be replaced across columns and their indices are provided. I tried this on a sample data, and its quite slow. My real data has 1E8 rows. Appreciate any suggestions.

library(Matrix)
library(microbenchmark)

microbenchmark(
    m1={
        n_row <- 8000
        n_col <- 5000

        # create a sparse matrix
        df <- Matrix(data=0, nrow=n_row, ncol=n_col, sparse=TRUE)

        # define indices to be replaced
        ind_replace <- data.frame(R1=c(4000, 5000), R2=c(1200, 3500), R3=c(7200, 7900))

        for (kk in 1:ncol(ind_replace)){
            df[ind_replace[1,kk]:ind_replace[2,kk], kk] <- 1
        }

    }
)

Unit: milliseconds
 expr      min       lq     mean   median       uq      max neval
   m1 18.48567 19.84298 22.48396 20.05846 20.48897 139.8459   100
Andrew Gustar
  • 17,295
  • 1
  • 22
  • 32
TTT
  • 4,354
  • 13
  • 73
  • 123
  • (1) Why use a `data.frame` to store the range extremes in individual columns? That seems odd and not extensible, I would use a `matrix` (low-case "m") or a list, depending on how you come up with the ranges. (2) Your benchmark is *including* the creation of the `Matrix` and the `seq`uences, so you aren't measuring *just* the value replacement. (3) You're quibbling over milliseconds here, how big or complex is the real problem that you need to be optimizing to this degree? – r2evans Sep 16 '17 at 16:51
  • @r2evans, the real problem's matrix at 1e8*5e3 level and for each column, I need to replace at least 80000 rows to 1, which is quite slow. For demo, I created this example. – TTT Sep 16 '17 at 16:59
  • @tao.hong Once you generate a sequence of indices you want to equate to 1, try this solution: https://stackoverflow.com/questions/44692603/create-a-sparse-matrix-given-the-indices-of-non-zero-elements-for-creation-of-d/44694865#44694865 – tushaR Sep 16 '17 at 18:03
  • Also, in the example you have provided you are trying to access column indices which are beyond the size of the matrix. `df[,7200]` does not exist. – tushaR Sep 16 '17 at 18:18
  • 1
    `l = lapply(ind_replace, function(x) x[1]:x[2]) ; n = lengths(l) ; sparseMatrix(i=unlist(l), j=rep(seq_len(ncol(ind_replace)), times=n), x=1, dims=c(n_row, n_col))` gives some speed up – user20650 Sep 16 '17 at 19:11
  • @user20650, thanks for offering a solution. – TTT Sep 16 '17 at 21:46
  • @tao.hong ; you're welcome - probably gain more efficiency by tinkering with the format of your indices input (if possible) – user20650 Sep 16 '17 at 22:52

1 Answers1

2

Try this after excluding R3 =c(7200,7900) from ind_replace as those columns do not exist in the matrix you are creating:

library(Matrix)
n_row <- 8000
n_col <- 5000
ind_replace = data.frame(R1=c(4000, 5000), R2=c(1200, 3500))
spmat<-Matrix(0,nrow = n_row ,ncol = n_col,sparse = T)

Create a matrix ind containing the row and column indices of non-zero elements.

ind = apply(ind_replace,MARGIN = 2,function(t){data.frame(a= t[1]:t[2],b= t[1])})
ind = as.matrix(Reduce(function(x,y){rbind(x,y)},ind))
spmat[ind]=1
tushaR
  • 3,083
  • 1
  • 20
  • 33