Sorry to ask this ... it's surely a FAQ, and it's kind of a silly question, but it's been bugging me. Suppose I want to get the variance of every numeric column in a dataframe, such as
df <- data.frame(x=1:5,y=seq(1,50,10))
Naturally, I try
var(df)
Instead of giving me what I'd hoped for, which would be something like
x y
2.5 250
I get this
x y
x 2.5 25
y 25.0 250
which has the variances in the diagonal, and covariances in other locations. Which makes sense when I lookup help(var) and read that "var is just another interface to cov". Variance is covariance between a variable and itself, of course. The output is slightly confusing, but I can read along the diagonal, or generate only the variances using diag(var(df))
, sapply(df, var)
, or lapply(df, var)
, or by calling var
repeatedly on df$x
and df$y
.
But why? Variance is a routine, basic descriptive statistic, second only to mean. Shouldn't it be completely and totally trivial to apply it to columns of a dataframe? Why give me the covariances when I only asked for variances? Just curious. Thanks for any comments on this.