7

Sorry to ask this ... it's surely a FAQ, and it's kind of a silly question, but it's been bugging me. Suppose I want to get the variance of every numeric column in a dataframe, such as

df <- data.frame(x=1:5,y=seq(1,50,10))

Naturally, I try

var(df)

Instead of giving me what I'd hoped for, which would be something like

  x    y
2.5  250

I get this

     x   y
x  2.5  25
y 25.0 250

which has the variances in the diagonal, and covariances in other locations. Which makes sense when I lookup help(var) and read that "var is just another interface to cov". Variance is covariance between a variable and itself, of course. The output is slightly confusing, but I can read along the diagonal, or generate only the variances using diag(var(df)), sapply(df, var), or lapply(df, var), or by calling var repeatedly on df$x and df$y.

But why? Variance is a routine, basic descriptive statistic, second only to mean. Shouldn't it be completely and totally trivial to apply it to columns of a dataframe? Why give me the covariances when I only asked for variances? Just curious. Thanks for any comments on this.

Mars
  • 8,689
  • 2
  • 42
  • 70
  • 2
    [This](http://stackoverflow.com/q/9424311/324364) question might also make for some good reading. – joran Mar 27 '13 at 03:57

3 Answers3

10

The idiomatic approach is

sapply(df, var)

var has a method for data.frames which deals with data.frames by coercing to a matrix.

Variance is a routine basic descriptive statistic, so are covariances and correlations. They are all interlinked and interesting , especially if you are aiming to use a linear model.

You could always create your own function to perform as you want

Var  <- function(x,...){
  if(is.data.frame(x)) {
   return(sapply(x, var,...))} else { return(var(x,...))}
}
mnel
  • 113,303
  • 27
  • 265
  • 254
  • It may be a little confusing to say that `var` has a method for data frames; it doesn't in the usual R sense of the word method (an S3 or S4 method). data frames are simply converted to matrices and then `cov` applied. – Gavin Simpson Mar 27 '13 at 04:06
  • Thank you mnel, GavinSimpson, SimonO101. These are all very helpful answers, as is joran's link--so that although I've voted them all up, I'm unwilling to mark one as _the_ answer. I get it. – Mars Mar 27 '13 at 15:20
9

This is documented in ?var, namely:

Description:

     ‘var’, ‘cov’ and ‘cor’ compute the variance of ‘x’ and the
     covariance or correlation of ‘x’ and ‘y’ if these are vectors.  If
     ‘x’ and ‘y’ are matrices then the covariances (or correlations)
     between the columns of ‘x’ and the columns of ‘y’ are computed.

where by "matrices" the text means objects of class "matrix" and "data.frame".

var doesn't have a method for data frames in the conventional sense. var simply coerces the input data frame to a matrix via as.matrix and then calls cov on that matrix.

In response to the question why, well I guess that the variance is closely related to the concept of covariance and to keep code simple R Core wrote a single implementation for the covariance of a matrix-like object and used this for the variance as that is the most likely thing you want from a matrix.

Or more succinctly; that is how R Core implemented this. Learn to live with it. :-)

Also note that R is moving away from having functions like mean and sd operate on the components (columns) of a data frame. If you want to apply any of these functions, including var, you are required to call something like:

apply(foo, 2, mean) ## for matrices
sapply(foo, mean) ## for data frames

or faster specific alternatives

colMeans(foo)

In this instance, I suspect that diag(var(df)) will be the most efficient way to get the variances instead of calling var repeatedly via one of the apply family of functions. diag(var(df)) is unlikely to be quicker than sapply(df, var) as the former has to compute all the covariances as well as the variances.

Gavin Simpson
  • 170,508
  • 25
  • 396
  • 453
  • +1 and I've edited my response to note the lack of a conventional `method` – mnel Mar 27 '13 at 04:08
  • I was curious about whether diag(var(df)) or sapply(df, var) would be faster. In this example: `df <- data.frame(x=rnorm(4*10^7), y=rbinom(4*10^7, 100, .5))`, `sapply(df,var)` takes about 1.8 seconds and increases R's RAM usage from about 1GB to 1.7GB during the operation, while `sapply(diag(var(df)))` takes about 4.5 seconds, and increases RAM from 1GB to about 2.5GB during the operation (R 2.15.0, MacBook Air 1.6GHz Intel Core 2 Duo, 4GB RAM, OS X 10.6.8). – Mars Mar 27 '13 at 16:15
  • @Mars No, you have that wrong. The entire call is `diag(var(df))`; you don't `sapply` it. As you show in your question `var(df)` returns the entire variance-covariance matrix of `df`. The bits you want are on the diagonal so we extract them with `diag()`. Anyway, I realise now that this way also computes the covariances so is likely slower than the `sapply` version. – Gavin Simpson Mar 27 '13 at 16:29
  • Sorry, that was a typo. I did run `diag(var(df))` without `sapply` (which wouldn't work, I believe). – Mars Mar 27 '13 at 20:12
1

Your actual answer has been covered by @GavinSimpson. For var you could also just use:

sd(df)^2
# x     y 
# 2.5 250.0 

And by doing so you will see what @GavinSimpson means about R "moving away from having functions like mean and sd operate on the components (columns) of a data frame". Deprecated means the functionality maybe be retired with an impending version change of R and your code may break if you dont heed the warning and change appropriately:

Warning message: sd() is deprecated. Use sapply(*, sd) instead.

So we could use:

sapply(df,sd)^2
# x     y 
# 2.5 250.0 

Which gives us the exact same result.

However, it's kinda silly to do it this way as you are effectively calling (sqrt(var(x, na.rm = na.rm)))^2 on each column! Instead as @mnel suggests, sapply( df , var) is how you should obtain the variance for each column vector.

Simon O'Hanlon
  • 58,647
  • 14
  • 142
  • 184