2

I have a piece of R code I want to optimise for speed working with larger datasets. It currently depends on sapply cycling through a vector of numbers (which correspond to rows of a sparse matrix). The reproducible example below gets at the nub of the problem; it is the three line function expensive() that chews up the time, and its obvious why (lots of matching big vectors to eachother, and two nested paste statements for each cycle of the loop). Before I give up and start struggling with doing this bit of the work in C++, is there something I'm missing? Is there a way to vectorize the sapply call that will make it an order of magnitude or three faster?

library(microbenchmark)

# create an example object like a simple_triple_matrix
# number of rows and columns in sparse matrix:
n <- 2000 # real number is about 300,000
ncols <- 1000 # real number is about 80,000

# number of non-zero values, about 10 per row:
nonzerovalues <- n * 10

stm <- data.frame(
  i = sample(1:n, nonzerovalues, replace = TRUE),
  j = sample(1:ncols, nonzerovalues, replace = TRUE),
  v = sample(rpois(nonzerovalues, 5), replace = TRUE)
)

# It seems to save about 3% of time to have i, j and v as objects in their own right
i <- stm$i
j <- stm$j
v <- stm$v

expensive <- function(){
  sapply(1:n, function(k){
    # microbenchmarking suggests quicker to have which() rather than a vector of TRUE and FALSE:
    whichi <- which(i == k)
    paste(paste(j[whichi], v[whichi], sep = ":"), collapse = " ")
  })
}

microbenchmark(expensive())

The output of expensive is a character vector, of n elements, that looks like this:

 [1] "344:5 309:3 880:7 539:6 338:1 898:5 40:1"                                                                                
 [2] "307:3 945:2 949:1 130:4 779:5 173:4 974:7 566:8 337:5 630:6 567:5 750:5 426:5 672:3 248:6 300:7"                         
 [3] "407:5 649:8 507:5 629:5 37:3 601:5 992:3 377:8" 

For what its worth, the motivation is to efficiently write data from a sparse matrix format - either from slam or Matrix, but starting with slam - into libsvm format (which is the format above, but with each row beginning with a number representing a target variable for a support vector machine - omitted in this example as it's not part of the speed problem). Trying to improve on the answers to this question. I forked one of the repositories referred to from there and adapted its approach to work with sparse matrices with these functions. The tests show that it works fine; but it doesn't scale up.

Community
  • 1
  • 1
Peter Ellis
  • 5,694
  • 30
  • 46

1 Answers1

3

Use package data.table. Its by combined with the fast sorting saves you from finding the indices of equal i values.

res1 <- expensive()


library(data.table)
cheaper <- function() {
  setDT(stm)
  res <- stm[, .(i, jv = paste(j, v, sep = ":"))
      ][, .(res = paste(jv, collapse = " ")), keyby = i][["res"]]

  setDF(stm) #clean-up which might not be necessary
  res
}

res2 <- cheaper()

all.equal(res1, res2)
#[1] TRUE

microbenchmark(expensive(),
               cheaper())  
#Unit: milliseconds
#        expr       min        lq      mean    median        uq       max neval cld
# expensive() 127.63343 135.33921 152.98288 136.13957 138.87969 222.36417   100   b
#   cheaper()  15.31835  15.66584  16.16267  15.98363  16.33637  18.35359   100  a 
Roland
  • 127,288
  • 10
  • 191
  • 288
  • 1
    This is brilliant, thanks. Now the full operation, including writing to disk, of a 300,000 by 83,000 sparse matrix takes just two minutes. – Peter Ellis Jan 05 '17 at 07:35
  • 2
    Note that data.table is now parallelized internally. `fwrite` might also be of interest. – Roland Jan 05 '17 at 07:38