16

I'd like to compute the variance for each row in a matrix. For the following matrix A

     [,1] [,2] [,3]
[1,]    1    5    9
[2,]    5    6   10
[3,]   50    7   11
[4,]    4    8   12

I would like to get

[1]  16.0000   7.0000 564.3333  16.0000

I know I can achieve this with apply(A,1,var), but is there a faster or better way? From octave, I can do this with var(A,0,2), but I don't get how the Y argument of the var() function in R is to be used.

Edit: The actual dataset of a typical chunk has around 100 rows and 500 columns. The total amount of data is around 50GB though.

Toby
  • 2,174
  • 4
  • 22
  • 32

2 Answers2

29

You could potentially vectorize var over rows (or columns) using rowSums and rowMeans

RowVar <- function(x, ...) {
  rowSums((x - rowMeans(x, ...))^2, ...)/(dim(x)[2] - 1)
}

RowVar(A)
#[1]  16.0000   7.0000 564.3333  16.0000

Using @Richards data, yields in

microbenchmark(apply(m, 1, var), RowVar(m))

## Unit: milliseconds
## expr        min         lq     median         uq        max neval
## apply(m, 1, var) 343.369091 400.924652 424.991017 478.097573 746.483601   100
##        RowVar(m)   1.766668   1.916543   2.010471   2.412872   4.834471   100

You can also create a more general function that will receive a syntax similar to apply but will remain vectorized (the column wise variance will be slower as the matrix needs to be transposed first)

MatVar <- function(x, dim = 1, ...) {
  if(dim == 1){
     rowSums((x - rowMeans(x, ...))^2, ...)/(dim(x)[2] - 1)
  } else if (dim == 2) {
     rowSums((t(x) - colMeans(x, ...))^2, ...)/(dim(x)[1] - 1)
  } else stop("Please enter valid dimension")
}


MatVar(A, 1)
## [1]  16.0000   7.0000 564.3333  16.0000

MatVar(A, 2)
        V1         V2         V3 
## 547.333333   1.666667   1.666667 
David Arenburg
  • 91,361
  • 17
  • 137
  • 196
  • 3
    Yeah that speeds it up a bit (a lot). Where did the last 298 milliseconds of my life go? – Rich Scriven Aug 02 '14 at 22:45
  • what if you want to apply it only to a subset of columns? – skan Jul 22 '15 at 10:44
  • 2
    @skan so subset the within the matrix, for example, `RowVar(A[, 1:3)` will run it only on the first 3 columns. Also, take a look at the `matrixStats` package, they covered pretty much everything there. You can also download these functions directly from GH, see [here](https://gist.github.com/DavidArenburg/cca995ec1709c0d4735d) – David Arenburg Jul 22 '15 at 10:45
  • Amazing. I was able to cut my time in half. Thanks a lot! – AHegde May 23 '16 at 19:28
13

This is one of the main reasons why apply() is useful. It is meant to operate on the margins of an array or matrix.

set.seed(100)
m <- matrix(sample(1e5L), 1e4L)
library(microbenchmark)
microbenchmark(apply(m, 1, var))
# Unit: milliseconds
#              expr      min       lq   median       uq      max neval
#  apply(m, 1, var) 270.3746 283.9009 292.2933 298.1297 343.9531   100 

Is 300 milliseconds too long to make 10,000 calculations?

Rich Scriven
  • 97,041
  • 11
  • 181
  • 245
  • We'll see if it's too much when I tackle the whole 50GB, but knowing that apply() is the way to go in this case already helps. Thanks! – Toby Aug 03 '14 at 11:11