1

I need to calculate several statistical parameters for a vector, while omitting each value within it once. Since this happens on a large dataset with many parameters, I'm looking for a general approach to optimize for performance. A simple example would be:

v <- c(9, 14, 8, 12, 5, 10, 6, 9, 9, 9, 9, 10, 8, 11, 9, 9, 10, 6, 10, 10)

sapply(1:length(v), function(x){
    var(v[-x])
})

Resulting in the desired result with a vector containing the total variance of v, if each element is omitted once:

 [1] 4.134211 4.134211 4.134211 4.134211 4.134211 4.134211 4.134211 4.134211 4.134211 4.134211 4.134211 4.134211 4.134211 4.134211 4.134211 4.134211
[17] 4.134211 4.134211 4.134211 4.134211

As stated, this results in poor performance when working with larger datasets and multiple parameters. Since loops are sometimes shamed for being slow, I'm looking for efficient alternatives, i.e. vectorized functions.

Thank you!

EDIT: Both proposed solutions boost performance significantly. While Dominiks solution wins the race for speed, Rolands approach is more general and can be used in a wider fashion. Therefore, Rolands answer is marked as correct, while I will use Dominiks solution for this particular situation. Thanks to both!

Results with N = 2000
    Unit: milliseconds
                      expr      min        lq       mean    median        uq      max neval
         original approach 117.2269 122.38290 130.933014 124.95565 128.69030 453.0770   100
      approach from Roland  57.1625  64.75505  96.255364  67.88550 168.55915 204.6941   100
     approach from Dominik   2.7083   2.89440   3.395894   2.99545   3.24165  30.0510   100
Comfort Eagle
  • 2,112
  • 2
  • 22
  • 44

2 Answers2

4

We can use formula for variance: sum((v-m)^2)/(n-1)

where

n <- length(v)
m <- mean(v)

Let i be any index from 1 to n. Then with a little math we get:

#x = v[i]
 #var(v[-i]) is equal to (sum(v^2)-x^2-(sum(v)-x)^2/(n-1))/(n-2)

After drawing some part against brackets, here is your code:

a <- sum(v^2)/(n-2) - sum(v)^2/(n-1)/(n-2)
b <- n/(n-1)/(n-2)
d <- 2*sum(v)/((n-1)*(n-2))

apply(X = as.matrix(v), MARGIN = 1, FUN = function(x){
  a -b*x^2 + d*x
})
Dominik Rafacz
  • 539
  • 3
  • 11
  • Thanks Dominik, that really boosts performance a lot! I'm going to use this, while marking Rolands post as correct, because it is closer to a general approach! – Comfort Eagle Feb 13 '19 at 16:30
  • Leave-one-out variance formula `(sum(v^2) - v^2 - (sum(v) - v)^2 / (n-1)) / (n-2)` – Khashaa Mar 01 '19 at 04:05
1

You could use combn to create a matrix of all combinations and then use a fast implementation of column-wise variance calculation. This should be efficient as long the vector is not huge and you have sufficient memory.

library(microbenchmark)

library(matrixStats)

microbenchmark(loop = {
  res1 <- sapply(1:length(v), function(x){
    var(v[-x])
  })
},
combn = {res2 <- colVars(combn(v, length(v) - 1))}
)

#Unit: microseconds
# expr     min       lq     mean  median       uq      max neval cld
# loop 633.528 646.0755 736.6643 654.526 675.9085 5652.840   100   b
#combn  58.641  62.4820  67.7778  66.067  69.1400  173.106   100  a 


all.equal(sort(res1), sort(res2))
#[1] TRUE
Roland
  • 127,288
  • 10
  • 191
  • 288