1

How can I, in R calculate the overall variance and the variance for each group from a dataset that looks like this (for example):

Group Count Value
A      3     5
A      2     8
B      1     11
B      3     15

I know to calculate the variance as a whole, ignoring the groups I would do: var(rep(x$Value, x$Count)), but how do I automatically calculate the variance for each group accounting for the frequency? E.g., the variance for group A, group B, etc.,.. I would like my output to have the following headers:

Group, Total Count, Group Variance 

I have also reviewed this link; R computing mean, median, variance from file with frequency distribution which is different (does not have the group component) so this is not a duplicate.

thank you for all of the help.

Community
  • 1
  • 1
blast00
  • 559
  • 2
  • 8
  • 18

2 Answers2

3

One option is using data.table. Convert the data.frame to data.table (setDT) and get the var of "Value" and sum of "Count" by "Group".

library(data.table)
setDT(df1)[, list(GroupVariance=var(rep(Value, Count)),
                      TotalCount=sum(Count)) , by = Group]
#    Group GroupVariance TotalCount
#1:     A           2.7          5
#2:     B           4.0          4

a similar way using dplyr is

library(dplyr)
group_by(df1, Group) %>% 
      summarise(GroupVariance=var(rep(Value,Count)), TotalCount=sum(Count))
#     Group GroupVariance TotalCount
#1     A           2.7          5
#2     B           4.0          4
akrun
  • 874,273
  • 37
  • 540
  • 662
3

Here's a quick wrapper with base R. First step is to grow your data set by Count, and then calculate variance by group

df1 <- df[rep(seq_len(nrow(df)), df$Count), ]
with(df1, tapply(Value, Group, var))
#   A   B 
# 2.7 4.0 

Or similarly

aggregate(Value ~ Group, df1, function(x) c(Var = var(x), Count = length(x)))
#   Group Value.Var Value.Count
# 1     A       2.7         5.0
# 2     B       4.0         4.0
David Arenburg
  • 91,361
  • 17
  • 137
  • 196
  • I am trying to figure it out - but the aggregate option isn't working for me. Not an error, just the wrong values... your first option and the "data table" one, works. – blast00 Feb 22 '15 at 16:27
  • Are you using `df` or `df1` in the `aggregate`. It's important to use `df1` – David Arenburg Feb 22 '15 at 16:27
  • 1
    Ah, I see. Thanks. works now. So the aggregate is an alternative to the "with" option here. Thanks for the help. So many different ways to do the same thing- trying to figure out which one is the mount intuitive. – blast00 Feb 22 '15 at 16:30
  • @DavidArenburg, is it possible to compute variance with base R without rep() function? I have probabilities but don't have a frequency table? – Nick Mar 01 '20 at 02:56
  • @Nick The `rep` function has nothing to do with `var`. It was used to extend the data size. – David Arenburg Mar 01 '20 at 07:17