I've always taken it as fact that colMeans()
or colSums()
are the fastest way to perform their respective operations. As a ground rule, I am talking about within base and not dplyr
or data.table
implementations.
While teaching some new users, I ran the benchmark myself to prove the point. I am now consistently seeing contradicting conclusions.
n = 10000
p = 100
test_matrix <- matrix(runif(n*p), n, p)
test_df <- as.data.frame(test_matrix)
benchmark <- microbenchmark(
colMeans(test_df),
colMeans(as.matrix(test_df)),
sapply(test_df, mean),
vapply(test_df, mean, 0),
colMeans(test_matrix),
apply(test_matrix, 2, mean)
)
Unit: microseconds
expr min lq mean median uq max neval
colMeans(test_df) 3099.941 3165.8290 3733.024 3241.345 3617.039 11387.090 100
colMeans(as.matrix(test_df)) 3091.634 3158.0880 3553.537 3241.345 3548.507 8531.067 100
sapply(test_df, mean) 2209.227 2267.3750 2723.176 2338.172 2602.289 10384.612 100
vapply(test_df, mean, 0) 2180.153 2228.2945 2611.982 2270.584 2514.123 7421.356 100
colMeans(test_matrix) 904.307 915.0685 1020.085 939.422 1002.667 2985.911 100
apply(test_matrix, 2, mean) 9748.388 9957.0020 12098.328 10330.429 12582.889 34873.009 100
For a matrix, colMeans()
torches apply()
That is expected. But for a data frame, sapply()
and vapply()
routinely beat colMeans()
, even as I increase n
and p
. Is there a reason why I would want to use colMeans()
on a data frame? It appears that the difference comes from the overhead associated with converting the data frame back into a matrix.
Main Question
In other words, is there a reason why (a more formal version of) the following would be inadvisable? Benchmarks show basically no drop off. Obviously this makes an assumption about the input the user pushes in, but that is not the point here.
colMeans2 <- function(myobject) {
if (typeof(myobject) == "double") {
colMeans(myobject)
} else if (typeof(myobject) == "list") {
vapply(myobject, mean, 0)
} else {
stop("what is this")
}
}
For Reference
Here are two other posts I could find, both somewhat related and mentioning how colMeans()
should be faster.
Grouping functions (tapply, by, aggregate) and the *apply family
Why are `colMeans()` and `rowMeans()` functions faster than using the mean function with `lapply()`?