4

I have data frame looking like this

V1   V2
..   1
..   2
..   1
..   3

etc.

For each distinct V2 value i would like to calculate variance of data in V1. I have just started my adventure with R, any hints how to do this? for my specific case i guess i can do manually something like

 var1 = var(data[data$V2==1, "V1"])
 var2 = ...

etc because I know all possible V2 values ( there are not many ), however I am curious what would be more generic solutions. Any ideas?

Matt Dowle
  • 58,872
  • 22
  • 166
  • 224
mkk
  • 7,583
  • 7
  • 46
  • 62

5 Answers5

10

And the old standby, tapply:

dat <- data.frame(x = runif(50), y = rep(letters[1:5],each = 10))
tapply(dat$x,dat$y,FUN = var)

         a          b          c          d          e 
0.03907351 0.10197081 0.08036828 0.03075195 0.08289562 
719016
  • 9,922
  • 20
  • 85
  • 158
joran
  • 169,992
  • 32
  • 429
  • 468
  • 1
    I second that . . . the `*apply` family of functions are very powerful and underrepresented in the accepted answers on Stack Overflow. – adamleerich Aug 25 '11 at 07:48
7

Another solution using data.table. It is a lot faster, especially useful when you have large data sets.

require(data.table)
dat2 = data.table(dat)
ans  = dat2[,list(variance = var(V1)),'V2']
Ramnath
  • 54,439
  • 16
  • 125
  • 152
4

There are a few ways to do this, I prefer:

dat <- data.frame(V1 = rnorm(50), V2=rep(1:5,10))
dat

aggregate (V1~V2, data=dat, var) # The first argument tells it to group V1 based on the values in V2, the last argument simply tells it the function to apply.

> aggregate (V1~V2, data=dat, var)
  V2        V1
1  1 0.9139360
2  2 1.6222236
3  3 1.2429743
4  4 1.1889356
5  5 0.7000294

Also look into ddply, daply etc in the plyr package.

nzcoops
  • 9,132
  • 8
  • 41
  • 52
  • thanks, that was very helpful. i will accept this answer in 8 minutes – mkk Aug 25 '11 at 01:16
  • actually i rushed a little bit, when I copy paste your example i get error: Error in get(as.character(FUN), mode = "function", envir = envir) : object 'FUN' of mode 'function' was not found – mkk Aug 25 '11 at 01:24
  • the second one. I have the newest version i guess, 2.13.1 (Windows 7). Maybe it is because of not loading some packages? anyway i have managed to make it work via ddply. I have copy-pasted wespiserA code and that worked as a charm without any modification, so I will accept his answer instead. I tried some simple things to fix your method, like adding FUN=var but it still did not want to work – mkk Aug 25 '11 at 01:36
  • the error message is in aggregate function (third line, sorry). Here is the message: > aggregate (V1~V2, data=dat, var) Error in get(as.character(FUN), mode = "function", envir = envir) : object 'FUN' of mode 'function' was not found – mkk Aug 25 '11 at 01:49
  • 1
    when you type 'var' and hit enter does it return a ~12 line function? The only way I can replicate that error is by assigning something to 'var' hence removing the var function... Try typing 'rm(var)' to clear out anything you've stored in var. Please try 'aggregate (V1~V2, data=data, stats::var)' to be sure, this directly calls the var function from base if it has been changed, does 'var(data[data$V2==1, "V1"])' form your question actually still work in your session? – nzcoops Aug 25 '11 at 02:05
  • u're 100% right about var. i had tested manual approach before and i assigned its result to var :) after rm(var) everything worked. – mkk Aug 25 '11 at 02:53
3
library(reshape)
ddply(data, .(V2), summarise, variance=var(V1))
wespiserA
  • 3,131
  • 5
  • 28
  • 36
  • Isn't ddply in the plyr package? I need to play with ddply a bit more. There's something just not intuitive about the .variable naming convention and the use of summarise seems so arbitrary. – nzcoops Aug 25 '11 at 01:23
  • It is. Plyr is a required package for reshape. I use both functions from both, so I usually just import reshape – wespiserA Aug 25 '11 at 01:27
  • @nzcoops The `.fun`, `.variable` naming convention is done to mitigate the sort of object name conflicts that arose in the comments to your answer! ;) The idea is that people will be very unlikely to name their own variables/functions `.foo`. – joran Aug 25 '11 at 02:52
  • Heh yup. I guess it's just how far down the worm hole you want to go. I find things like aggregate clearer and more logical, like I say with the summarise and transform in the plyr functions, they're not intuitive, or mentioned in the help. Similarly with the data.table solution, I would never about thought you could have two ',' commas separating things inside the [] of a table, it's counter intuitive to what you learn starting out in R (missable in help given it's layout compared to base functions). These things risk become expert functions not general use unfortunately, IMO of course. – nzcoops Aug 25 '11 at 03:08
  • @nzcoops All of the unintuitive things you mention are base R functions and has nothing to do with anything that `plyr` adds. See `?"[.data.frame"` for how to use the third comma. See `?transform` for help about this base R function. – Andrie Aug 25 '11 at 08:11
  • @Andrie summarise is part of plyr. By unintuitive I don't mean I think they're wrong, I just think people coming out of intro (or follow up) R courses may well miss these things. As mentioned in a comment below, people are quick to recommend and accept plyr answers like this, my experience is ddply's performance is far inferior to bases aggregate (and others). This afternoon aggregate took 38s to generate the output for 1.3mil rows where ddply took 52s to generate the output (same function) on the first 10k rows. Everything has it's place though. – nzcoops Aug 25 '11 at 08:59
0

Using dplyr you can do

library(dplyr)
data %>%
  group_by(V2) %>%
  summarize(var = var(V1))

Here we group by the unique values of V2 and find the variance of V1 for each group.

MrFlick
  • 195,160
  • 17
  • 277
  • 295