2

I've always taken it as fact that colMeans() or colSums() are the fastest way to perform their respective operations. As a ground rule, I am talking about within base and not dplyr or data.table implementations.

While teaching some new users, I ran the benchmark myself to prove the point. I am now consistently seeing contradicting conclusions.

n = 10000
p = 100

test_matrix <- matrix(runif(n*p), n, p)
test_df <- as.data.frame(test_matrix) 

benchmark <- microbenchmark(
  colMeans(test_df),
  colMeans(as.matrix(test_df)),
  sapply(test_df, mean),
  vapply(test_df, mean, 0),
  colMeans(test_matrix),
  apply(test_matrix, 2, mean)
)

Unit: microseconds
                         expr      min        lq      mean    median        uq       max neval
            colMeans(test_df) 3099.941 3165.8290  3733.024  3241.345  3617.039 11387.090   100
 colMeans(as.matrix(test_df)) 3091.634 3158.0880  3553.537  3241.345  3548.507  8531.067   100
        sapply(test_df, mean) 2209.227 2267.3750  2723.176  2338.172  2602.289 10384.612   100
     vapply(test_df, mean, 0) 2180.153 2228.2945  2611.982  2270.584  2514.123  7421.356   100
        colMeans(test_matrix)  904.307  915.0685  1020.085   939.422  1002.667  2985.911   100
  apply(test_matrix, 2, mean) 9748.388 9957.0020 12098.328 10330.429 12582.889 34873.009   100

For a matrix, colMeans() torches apply() That is expected. But for a data frame, sapply() and vapply() routinely beat colMeans(), even as I increase n and p. Is there a reason why I would want to use colMeans() on a data frame? It appears that the difference comes from the overhead associated with converting the data frame back into a matrix.

microbenchmark of means

Main Question

In other words, is there a reason why (a more formal version of) the following would be inadvisable? Benchmarks show basically no drop off. Obviously this makes an assumption about the input the user pushes in, but that is not the point here.

colMeans2 <- function(myobject) {
  if (typeof(myobject) == "double") {
    colMeans(myobject)
  } else if (typeof(myobject) == "list") {
    vapply(myobject, mean, 0)
  } else {
    stop("what is this")
  }
}

For Reference

Here are two other posts I could find, both somewhat related and mentioning how colMeans() should be faster.

Grouping functions (tapply, by, aggregate) and the *apply family

Why are `colMeans()` and `rowMeans()` functions faster than using the mean function with `lapply()`?

  • yes, you are right, overhead of transforming the data.frame to matrix each time. – minem Mar 05 '19 at 16:35
  • For data frames you can optimize the performance with `vapply(test_df, sum, 0) / n`. – Sven Hohenstein Mar 05 '19 at 16:40
  • You should probably use `is.array()` and `is.list()` instead of `typeof()`. But if you're making assumptions about input, then what's to critique? – Nathan Werth Mar 05 '19 at 16:53
  • So that is a great point Sven. That is another huge speedup that actually brings it close in speed to the `colMeans` on the matrix. So I guess I am failing to see the advantages of using these functions on a data frame. –  Mar 05 '19 at 16:55
  • From my experience, the traditional examples are with `matrices`, which is more natural for numeric data. Then the comparison is with `apply`, which still hasn't been optimized for performance. – RolandASc Mar 05 '19 at 17:15

0 Answers0