0

Speeding up loops in R can easily be done using a function from the apply family. How can I use an apply function in the code below to speed it up? Note that within the loop, at each iteration, one column is permuted and a function is applied to the new data frame (i.e., the initial data frame with one column permuted). I cannot seem to get apply to work because the new data frame has to be built within the loop.

#x <- data.frame(a=1:10,b=11:20,c=21:30) #small example
x <- data.frame(matrix(runif(50*100),nrow=50,ncol=100)) #larger example
y <- rowMeans(x)

start <- Sys.time()  

totaldiff <- numeric()

for (i in 1:ncol(x)){
    x.after <- x

    x.after[,i] <- sample(x[,i])

    diff <- abs(y-rowMeans(x.after))

    totaldiff[i] <- sum(diff)

}

colnames(x)[which.max(totaldiff)]

Sys.time() - start
user1134616
  • 596
  • 6
  • 14
  • 1
    You want someone to review your working code and optimise it? This is more suited to [**Code Review**](http://codereview.stackexchange.com/). – Simon O'Hanlon May 10 '13 at 16:28
  • 5
    The premise of the question is incorrect. For-loops cannot be made more efficient with the apply family. If it is inefficient it is because the body needs work. There are several SO question on optimization with a device about preallocating vectors, using vectorized functions, and other standard approaches . I think this one should be closed as a duplicate. – IRTFM May 10 '13 at 16:51
  • here is a version that is faster than everything posted yet: `function(x) { x <- as.matrix(x); totaldiff <- colSums(abs((apply(x, 2, sample) - x) / ncol(x))); colnames(x)[which.max(totaldiff)] }` – flodel May 10 '13 at 17:50
  • 1
    http://stackoverflow.com/questions/2908822/speed-up-the-loop-operation-in-r/8474941#8474941 – Ari B. Friedman May 10 '13 at 19:39
  • first sentence is nonsense – mdsumner May 11 '13 at 00:08

3 Answers3

7

After working through this and other replies, the optimization strategies (and approximate speed-up) here seem to be

  • (30x) Choose an appropriate data representation -- matrix, rather than data.frame
  • (1.5x) Reduce unnecessary data copies -- difference of columns, rather than of rowMeans
  • Structure for loops as *apply functions (to emphasize code structure, simplify memory management, and provide type consistency)
  • (2x) Hoist vector operations outside loops -- abs and sum on columns become abs and colSums on a matrix

for an overall speed-up of about 100x. For this size and complexity of code, the use of the compiler or parallel packages would not be effective.

I put your code into a function

f0 <- function(x) {
    y <- rowMeans(x)
    totaldiff <- numeric()
    for (i in 1:ncol(x)){
        x.after <- x
        x.after[,i] <- sample(x[,i])
        diff <- abs(y-rowMeans(x.after))
        totaldiff[i] <- sum(diff)
    }
    which.max(totaldiff)
}

and here we have

x <- data.frame(matrix(runif(50*100),nrow=50,ncol=100)) #larger example
set.seed(123)
system.time(res0 <- f0(x))
##   user  system elapsed 
##  1.065   0.000   1.066 

Your data can be represented as a matrix, and operations on R matrices are faster than on data.frames.

m <- matrix(runif(50*100),nrow=50,ncol=100)
set.seed(123)
system.time(res0.m <- f0(m))
##   user  system elapsed 
##  0.036   0.000   0.037 
identical(res0, res0.m)
##[1] TRUE

That's probably the biggest speed-up. But for the specific operation here we don't need to calculate the row means of the updated matrix, just the change in the mean from shuffling one column

f1 <- function(x) {
     y <- rowMeans(x)
    totaldiff <- numeric()
    for (i in 1:ncol(x)){
        diff <- abs(sample(x[,i]) - x[,i]) / ncol(x)
        totaldiff[i] <- sum(diff)
    }
    which.max(totaldiff)
}

The for loop doesn't follow the right pattern for filling up the result vector totaldiff (you want to "pre-allocate and fill", so totaldiff <- numeric(ncol(x))) but we can use an sapply and let R worry about that (this memory management is one of the advantages of using the apply family of functions)

f2 <- function(x) {
    totaldiff <- sapply(seq_len(ncol(x)), function(i, x) {
        sum(abs(sample(x[,i]) - x[,i]) / ncol(x))
    }, x)
    which.max(totaldiff)
}
set.seed(123); identical(res0, f1(m))
set.seed(123); identical(res0, f2(m))

The timings are

> library(microbenchmark)
> microbenchmark(f0(m), f1(m), f2(m))
Unit: milliseconds
  expr      min       lq   median       uq      max neval
 f0(m) 32.45073 33.07804 33.16851 33.26364 33.81924   100
 f1(m) 22.20913 23.87784 23.96915 24.06216 24.66042   100
 f2(m) 21.02474 22.60745 22.70042 22.80080 23.19030   100

@flodel points out that vapply can be faster (and provides type safety)

f3 <- function(x) {
    totaldiff <- vapply(seq_len(ncol(x)), function(i, x) {
        sum(abs(sample(x[,i]) - x[,i]) / ncol(x))
    }, numeric(1), x)
    which.max(totaldiff)
}

and that

f4 <- function(x)
    which.max(colSums(abs((apply(x, 2, sample) - x))))

is still faster (ncol(x) is a constant factor, so removed) -- The abs and sum are hoisted outside the sapply, maybe at the expense of additional memory use. The advice in the comments to compile functions is good in general; here are some further timings

>     microbenchmark(f0(m), f1(m), f1.c(m), f2(m), f2.c(m), f3(m), f4(m))
Unit: milliseconds
    expr      min       lq   median       uq       max neval
   f0(m) 32.35600 32.88326 33.12274 33.25946  34.49003   100
   f1(m) 22.21964 23.41500 23.96087 24.06587  24.49663   100
 f1.c(m) 20.69856 21.20862 22.20771 22.32653 213.26667   100
   f2(m) 20.76128 21.52786 22.66352 22.79101  69.49891   100
 f2.c(m) 21.16423 21.57205 22.94157 23.06497  23.35764   100
   f3(m) 20.17755 21.41369 21.99292 22.10814  22.36987   100
   f4(m) 10.10816 10.47535 10.56790 10.61938  10.83338   100

where the ".c" are compiled versions and

Compilation is particularly helpful in code written with for loops but doesn't do much for vectorized code; this is shown here where's a small but consistent improvement from compiling f1's for loop, but not f2's sapply.

Martin Morgan
  • 45,935
  • 7
  • 84
  • 112
4

Since you are looking at efficiency/optimization, start by using the rbenchmark package for comparison purposes.

Rewriting your given example as a function (so that it can be replicated and compared)

forFirst <- function(x) {
    y <- rowMeans(x)
    totaldiff <- numeric()
    for (i in 1:ncol(x)){
        x.after <- x
        x.after[,i] <- sample(x[,i])
        diff <- abs(y-rowMeans(x.after))
        totaldiff[i] <- sum(diff)
    }
    colnames(x)[which.max(totaldiff)]
}

Applying some standard optimizations (pre-allocating totaldiff to the right size, eliminating intermediate variables that are only used once) gives

forSecond <- function(x) {
    y <- rowMeans(x)
    totaldiff <- numeric(ncol(x))
    for (i in 1:ncol(x)){
        x.after <- x
        x.after[,i] <- sample(x[,i])
        totaldiff[i] <- sum(abs(y-rowMeans(x.after)))
    }
    colnames(x)[which.max(totaldiff)]
}

Not much more can be done for this that I can see to improve the algorithm itself in the loop. A better algorithm would be the most help, but since this particular problem is just an example, it is not worth spending that time.

The apply version looks very similar.

applyFirst <- function(x) {
      y <- rowMeans(x)
    totaldiff <- sapply(seq_len(ncol(x)), function(i) {
        x[,i] <- sample(x[,i])
        sum(abs(y-rowMeans(x)))
      })
    colnames(x)[which.max(totaldiff)]
}

Benchmarking them gives:

> library("rbenchmark")
> benchmark(forFirst(x),
+           forSecond(x),
+           applyFirst(x),
+           order = "relative")
           test replications elapsed relative user.self sys.self user.child
1   forFirst(x)          100   16.92    1.000     16.88     0.00         NA
2  forSecond(x)          100   17.02    1.006     16.96     0.03         NA
3 applyFirst(x)          100   17.05    1.008     17.02     0.01         NA
  sys.child
1        NA
2        NA
3        NA

The differences between these is just noise. In fact, running the benchmark again gives a different ordering:

> benchmark(forFirst(x),
+           forSecond(x),
+           applyFirst(x),
+           order = "relative")
           test replications elapsed relative user.self sys.self user.child
3 applyFirst(x)          100   17.05    1.000     17.02        0         NA
2  forSecond(x)          100   17.08    1.002     17.05        0         NA
1   forFirst(x)          100   17.44    1.023     17.41        0         NA
  sys.child
3        NA
2        NA
1        NA

So these approaches are the same speed. Any real improvement will come from using a better algorithm than just simple looping and copying to create the intermediate results.

flodel
  • 87,577
  • 21
  • 185
  • 223
Brian Diggs
  • 57,757
  • 13
  • 166
  • 188
1

Apply functions do not necessarily speed up loops in R. Sometimes they can even slow them down. There's no reason to believe that turning this into an apply family function will speed it up any appreciable amount.

As an aside, this code seems like a relatively pointless endeavour. It's just going to select a random column. I could get the same result by just doing that in the first place. Perhaps this is nested in a larger loop looking for a distribution?

John
  • 23,360
  • 7
  • 57
  • 83