3

I have a double loop that I not only don't like, but would take 14 days to run on my computer since it is going over 3200 records and 1090 variables at about .12 per iteration.

A smaller reproducible bit. It simply checks how many numbers are in the same column between two records, not including NA's. Then it attaches the results to the original data frame.

y <- data.frame(c(1,2,1,NA,NA),c(3,3,3,4,NA),c(5,4,5,7,7),c(7,8,7,9,10))
resultdf <- NULL
for(i in 1:nrow(y))
{
  results <- NULL
  for(j in 1:nrow(y))
  {
    results <- c(results,sum((y[i,]==y[j,]),na.rm=TRUE))
  }
  resultdf <- cbind(resultdf,results)
}
y <- cbind(y,resultdf)

I have repeat calculations that could possibly be avoided leaving about 7 days.

If I understand correctly, a few apply functions are in C that might be faster. I haven't been able to get any to work though. I'm also curious if there is a package that would run faster. Can anyone help speed up the calculation?

Thank you!

ARobertson
  • 2,857
  • 18
  • 24
  • 2
    you should start by seeing how much speed you get simply by converting `y` to a matrix before you start ... I think there may be something clever with rearranging the results of `outer(y,y,"==")` appropriately and taking row or column sums, but I don't have time to work it out right now ... – Ben Bolker Mar 05 '12 at 21:15
  • ... I assume by "repeat calculations" you're talking about looping over all (i,j) rather than just the lower or upper triangle ... – Ben Bolker Mar 05 '12 at 21:21
  • Changing to a matrix sped it up to about 16 minutes for the entire thing. Thanks for that hint! And yes, it's repeating calculations instead of calculating one of the triangles. How would you go about it? I'm guessing it's adding i <- i + 1 at the end of a loop to recalculate the lower boundary, but I've never done it. How do you copy it to the other triangle? – ARobertson Mar 05 '12 at 21:37

4 Answers4

3

Here is another solution, using outer.

f <- function(i,j) sum(y[i,] == y[j,], na.rm=TRUE)
d <- outer( 1:nrow(y), 1:nrow(y), Vectorize(f) )
Vincent Zoonekynd
  • 31,893
  • 5
  • 69
  • 78
3

I have created data to your specifications, and using @BenBolker's suggestion about using a matrix:

> y <- matrix(sample(c(1:9, NA), 3200 * 1090, replace = TRUE),
+             nrow = 3200, ncol = 1090)

and compared the computation times for three different implementations:

f1 was suggested by @Andrei:

> f1 <- function(y)apply(y, 1, function(r1)
+                  apply(y, 1, function(r2)sum(r1==r2, na.rm=TRUE)))

> system.time(r1 <- f1(y))
   user  system elapsed 
 523.51    0.77  528.73 

f2 was suggested by @VincentZoonekynd:

> f2 <- function(y) {
+   f <- function(i,j) sum(y[i,] == y[j,], na.rm=TRUE)
+   d <- outer( 1:nrow(y), 1:nrow(y), Vectorize(f) )
+   return(d)
+ }
> system.time(r2 <- f2(y))
   user  system elapsed 
 658.94    1.96  710.67

f3 is a double loop over the upper triangle as suggested by @BenBolker. It is also a bit more efficient than your OP in that it pre-allocates the output matrix:

> f3 <- function(y) {
+   result <- matrix(NA, nrow(y), nrow(y))
+   for (i in 1:nrow(y)) {
+     row1 <- y[i, ]
+     for (j in i:nrow(y)) {
+       row2 <- y[j, ]
+       num.matches  <- sum(row1 == row2, na.rm = TRUE)
+       result[i, j] <- num.matches
+       result[j, i] <- num.matches
+     }
+   }
+   return(result)
+ }

> system.time(r3 <- f3(y))
   user  system elapsed 
 167.66    0.08  168.72 

So the double loop is the fastest of all three, although not as elegant and compact as the other two answers.

flodel
  • 87,577
  • 21
  • 185
  • 223
  • An interesting benchmark. Usually apply functions work much faster than loops, e.g. as in sapply(vector, fun), but apparently not in this case. – Andrei Mar 06 '12 at 07:03
  • 1
    @Andrei, that's generally not true, see http://stackoverflow.com/questions/2275896/is-rs-apply-family-more-than-syntactic-sugar. – flodel Mar 06 '12 at 12:14
  • thanks for the link. I believed that apply functions are faster, but now that I start thinking about my belief, I can not find any reasons for it :) – Andrei Mar 07 '12 at 09:21
2

Indeed, you can use apply function. Given the earlier hint that a matrix works faster, I would try:

ym <- as.matrix(y)
resultdf <- apply(ym, 1, function(r1) apply(ym, 1, function(r2) sum(r1==r2, na.rm=TRUE)))
Andrei
  • 2,585
  • 1
  • 14
  • 17
1

You can get rid of the inner loop (using the y and f3 from @flodel's answer):

ty <- t(y)
ix <- rep(1:nrow(y),each = ncol(y))
f4 <- function(y){
    result <- matrix(0L, nrow(y), nrow(y))
    for(r in 1:nrow(y))
        result[r,] <- rowsum(as.numeric(ty == y[r,]), ix, na.rm = T)
    result
}



> system.time(out <- f4(y))
   user  system elapsed 
 52.616  21.061  74.000 
> system.time(out <- f3(y))
   user  system elapsed 
244.751   0.136 244.954 
> 

It actually does an extra job of computing twice the same thing, but is still 5 times faster. You can make it yet another 4 times faster by using the inner workings of rowsum. See this question for an example.

Community
  • 1
  • 1
VitoshKa
  • 8,387
  • 3
  • 35
  • 59
  • I've tested but I don't get the same computation times as you do. On my machine, `f4` is 5 times *slower*. Can you confirm or are you using some non-standard library like ATLAS/MKL/etc? I do get a decent speed increase (`f4` ~40% faster than `f3`) if I use `result[r,] <- colSums(ty == y[r, ], na.rm = TRUE)`. – flodel Mar 07 '12 at 04:29
  • @flodel, everything is standard, gnu linux R13.1 here. have you tested it with large matrix as in your example? For small matrixes indeed f4 is slower; `rowsum` is written for speed and is better used with a lot of groups and many columns. – VitoshKa Mar 07 '12 at 11:14
  • I have tested with a 3200-by-1090 matrix as mentioned in the OP. – flodel Mar 07 '12 at 12:11