After working through this and other replies, the optimization strategies (and approximate speed-up) here seem to be
- (30x) Choose an appropriate data representation -- matrix, rather than data.frame
- (1.5x) Reduce unnecessary data copies -- difference of columns, rather than of rowMeans
- Structure for loops as
*apply
functions (to emphasize code structure, simplify memory management, and provide type consistency)
- (2x) Hoist vector operations outside loops -- abs and sum on columns become abs and colSums on a matrix
for an overall speed-up of about 100x. For this size and complexity of code, the use of the compiler or parallel packages would not be effective.
I put your code into a function
f0 <- function(x) {
y <- rowMeans(x)
totaldiff <- numeric()
for (i in 1:ncol(x)){
x.after <- x
x.after[,i] <- sample(x[,i])
diff <- abs(y-rowMeans(x.after))
totaldiff[i] <- sum(diff)
}
which.max(totaldiff)
}
and here we have
x <- data.frame(matrix(runif(50*100),nrow=50,ncol=100)) #larger example
set.seed(123)
system.time(res0 <- f0(x))
## user system elapsed
## 1.065 0.000 1.066
Your data can be represented as a matrix, and operations on R matrices are faster than on data.frames.
m <- matrix(runif(50*100),nrow=50,ncol=100)
set.seed(123)
system.time(res0.m <- f0(m))
## user system elapsed
## 0.036 0.000 0.037
identical(res0, res0.m)
##[1] TRUE
That's probably the biggest speed-up. But for the specific operation here we don't need to calculate the row means of the updated matrix, just the change in the mean from shuffling one column
f1 <- function(x) {
y <- rowMeans(x)
totaldiff <- numeric()
for (i in 1:ncol(x)){
diff <- abs(sample(x[,i]) - x[,i]) / ncol(x)
totaldiff[i] <- sum(diff)
}
which.max(totaldiff)
}
The for
loop doesn't follow the right pattern for filling up the result vector totaldiff
(you want to "pre-allocate and fill", so totaldiff <- numeric(ncol(x))
) but we can use an sapply
and let R worry about that (this memory management is one of the advantages of using the apply family of functions)
f2 <- function(x) {
totaldiff <- sapply(seq_len(ncol(x)), function(i, x) {
sum(abs(sample(x[,i]) - x[,i]) / ncol(x))
}, x)
which.max(totaldiff)
}
set.seed(123); identical(res0, f1(m))
set.seed(123); identical(res0, f2(m))
The timings are
> library(microbenchmark)
> microbenchmark(f0(m), f1(m), f2(m))
Unit: milliseconds
expr min lq median uq max neval
f0(m) 32.45073 33.07804 33.16851 33.26364 33.81924 100
f1(m) 22.20913 23.87784 23.96915 24.06216 24.66042 100
f2(m) 21.02474 22.60745 22.70042 22.80080 23.19030 100
@flodel points out that vapply
can be faster (and provides type safety)
f3 <- function(x) {
totaldiff <- vapply(seq_len(ncol(x)), function(i, x) {
sum(abs(sample(x[,i]) - x[,i]) / ncol(x))
}, numeric(1), x)
which.max(totaldiff)
}
and that
f4 <- function(x)
which.max(colSums(abs((apply(x, 2, sample) - x))))
is still faster (ncol(x)
is a constant factor, so removed) -- The abs
and sum
are hoisted outside the sapply
, maybe at the expense of additional memory use. The advice in the comments to compile functions is good in general; here are some further timings
> microbenchmark(f0(m), f1(m), f1.c(m), f2(m), f2.c(m), f3(m), f4(m))
Unit: milliseconds
expr min lq median uq max neval
f0(m) 32.35600 32.88326 33.12274 33.25946 34.49003 100
f1(m) 22.21964 23.41500 23.96087 24.06587 24.49663 100
f1.c(m) 20.69856 21.20862 22.20771 22.32653 213.26667 100
f2(m) 20.76128 21.52786 22.66352 22.79101 69.49891 100
f2.c(m) 21.16423 21.57205 22.94157 23.06497 23.35764 100
f3(m) 20.17755 21.41369 21.99292 22.10814 22.36987 100
f4(m) 10.10816 10.47535 10.56790 10.61938 10.83338 100
where the ".c" are compiled versions and
Compilation is particularly helpful in code written with for loops but doesn't do much for vectorized code; this is shown here where's a small but consistent improvement from compiling f1's for loop, but not f2's sapply.