Make this loop faster in R

Question

How can I speed up the following (noob) code:

#"mymatrix" is the matrix of word counts (docs X terms) 
#"tfidfmatrix" is the transformed matrix
tfidfmatrix = Matrix(mymatrix, nrow=num_of_docs, ncol=num_of_words, sparse=T)

#Apply a transformation on each row of the matrix
for(i in 1:dim(mymatrix)[[1]]){
  r = mymatrix[i,]
  s = sapply(r, function(x) ifelse(x==0, 0, (1+log(x))*log((1+ndocs)/(1+x)) ) )
  tfmat[i,] = s/sqrt(sum(s^2))
}
return (tfidfmatrix)

Problem is that the matrices I am working on are fairly large (~40kX100k), and this code is very slow.

The reason I am not using "apply" (instead of using a for loop and sapply) is that apply is going to give me the transpose of the matrix I want - I want num_of_docs X num_of_words, but apply will give me the transpose. I will then have to spend more time computing the transpose and re-allocating it.

Any thoughts on making this faster?

Thanks much.

Edit : I have found that the suggestions below greatly speed up my code (besides making me feel stupid). Any suggestions on where I can learn to write "optimized" R code from?

Edit 2: OK, so something is not right. Once I do s.vec[!is.finite(s.vec)] <- 0 every element of s.vec is being set to 0. Just to re-iterate my original matrix is a sparse matrix containing integers. This is due to some quirk of the Matrix package I am using. When I do s.vec[which(s.vec==-Inf)] <- 0 things work as expected. Thoughts?

I don't know r, but have you tried moving `dim(mymatrix)` outside the loop? (can you?) — Luchian Grigore, Mar 05 '12 at 18:37
They probably could but it wouldn't make much of a difference. — Dason, Mar 05 '12 at 19:29
I believe I found this in the R FAQ some time ago. http://www.burns-stat.com/pages/Tutor/R_inferno.pdf. It is a brilliant and readable guide to vectorizing. — digitalmaps, Mar 06 '12 at 03:16

score 4 · Answer 1 · answered Mar 05 '12 at 19:10

As per my comment,

#Slightly larger example data
mymatrix <- matrix(runif(10000),nrow=10)
mymatrix[sample(10000,100)] <- 0
tfmat <- matrix(nrow=10, ncol=1000)
ndocs <- 1

justin <- function(){
    s.vec <- ifelse(mymatrix==0, 0, (1 + log(mymatrix)) * log((1 + ndocs)/(1 + mymatrix)))
    tfmat.vec <- s.vec/sqrt(rowSums(s.vec^2))
}

joran <- function(){
    s.vec <- (1 + log(mymatrix)) * log((1 + ndocs)/(1 + mymatrix))
    s.vec[!is.finite(s.vec)] <- 0
    tfmat.vec <- s.vec/sqrt(rowSums(s.vec^2))
}

require(rbenchmark)    
benchmark(justin(),joran(),replications = 1000)

  test replications elapsed relative user.self sys.self user.child sys.child
2  joran()         1000   0.940  1.00000     0.842    0.105          0         0
1 justin()         1000   2.786  2.96383     2.617    0.187          0         0

So it's around 3x faster or so.

@joran : It never ceases to amaze me what I find on this site. Thanks much. — user721975, Mar 05 '12 at 19:56

score 3 · Answer 2 · answered Mar 05 '12 at 18:57

3

not sure what ndocs is, but ifelse is already vectorized, so you should be able to use the ifelse statement without walking through the matrix row by row and sapply along the row. The same can be said for the final calc.

However, you haven't given a complete example to replicate...

mymatrix <- matrix(runif(100),nrow=10)
tfmat <- matrix(nrow=10, ncol=10)
ndocs <- 1

s.vec <- ifelse(mymatrix==0, 0, 1 + log(mymatrix)) * log((1 + ndocs)/(1 + mymatrix))

for(i in 1:dim(mymatrix)[[1]]){
  r = mymatrix[i,]
  s = sapply(r, function(x) ifelse(x==0, 0, (1+log(x))*log((1+ndocs)/(1+x)) ) )
  tfmat[i,] <- s
}

all.equal(s.vec, tfmat)

so the only piece missing is the rowSums in your final calc.

tfmat.vec <- s.vec/sqrt(rowSums(s.vec^2))

for(i in 1:dim(mymatrix)[[1]]){
  r = mymatrix[i,]
  s = sapply(r, function(x) ifelse(x==0, 0, (1+log(x))*log((1+ndocs)/(1+x)) ) )
  tfmat[i,] = s/sqrt(sum(s^2))
}

all.equal(tfmat, tfmat.vec)

answered Mar 05 '12 at 18:57

Justin

42,475
9
93
111

I'd bet (a small amount) that ditching `ifelse` entirely, and replacing the `-Inf` values by subsetting with `is.finite` will be even faster. – joran Mar 05 '12 at 19:03
@joran I keep hearing that, but haven't tested it myself. Good point though. Letting the logs return -Inf and changing them after might be the way to go. – Justin Mar 05 '12 at 19:07
@justin :Great suggestion. Thanks. – user721975 Mar 05 '12 at 19:55
Actually when I run this (on my sparse matrix object), I get the error `Error in storage.mode(test) <- "logical" : no method for coercing this S4 class to a vector` at the point where I try to use ifelse. – user721975 Mar 05 '12 at 20:39
@user721975 You should probably ask that as a separate question with a [small reproducible example](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) that replicates your error. Since we don't have your data, I made some up and it works great. – Justin Mar 05 '12 at 20:43
@justin : I think I probably will. BTW, can you explain how `tfmat.vec <- s.vec/sqrt(rowSums(s.vec^2))` is working? I just don't get how each element gets divided by the row sum of the row it is in. – user721975 Mar 05 '12 at 21:19

Make this loop faster in R

2 Answers2