10

I know this must be super easy, but I'm having trouble finding the right dplyr commands to do this. Let's say I want to group a dataset by two variables, and then summarize the count for each row. For this we simply have:

mtcars %>% group_by(cyl, mpg) %>% summarize(Count = n())

This will generate a dataframe with 27 rows for the three variables cyl, mpg, and Count. What I'd like to do next is summarize the average mpg for each of the three cyl values. Keep in mind that each row may contain a Count greater than one which must be considered when calculating the average. My data frame should have 3 rows of 2 variables cyl, and Avg_mpg. Can someone give me the short code chuck that will do this? Thank you in advance.

smci
  • 32,567
  • 20
  • 113
  • 146
ds_guy
  • 143
  • 2
  • 5

2 Answers2

8

If I have understood you correctly, you need weighted.mean

library(dplyr)
mtcars %>% 
   group_by(cyl, mpg) %>% 
   summarize(Count = n()) %>%
   group_by(cyl) %>%
   summarise(avg_mpg = weighted.mean(mpg, Count))

# A tibble: 3 x 2
#    cyl   avg_mpg
#  <dbl>   <dbl>
#1  4.00    26.7
#2  6.00    19.7
#3  8.00    15.1

which is equivalent to

mtcars %>% 
  group_by(cyl, mpg) %>% 
  summarize(Count = n()) %>%
  group_by(cyl) %>%
  summarise(avg_mpg = sum(mpg * Count)/sum(Count))
Ronak Shah
  • 377,200
  • 20
  • 156
  • 213
  • 1
    That's exactly what I needed. Thank you. – ds_guy Apr 24 '18 at 01:25
  • I think you need to use `mutate(Count = n())`, not `summarize(Count = n())`, so as the weighting column `Count` is added to all rows before calculating the weighted mean. Otherwise the result is exactly the same as the mean grouped by `cyl` only. – neilfws Apr 24 '18 at 01:57
  • @neilfws I think I agree with you however, the first part of code is provided by OP and I am not sure how are they using it in their real data. So I would leave it to OP to decide how to use it. Thanks though :) – Ronak Shah Apr 24 '18 at 02:09
1

You are effectively performing a simple mean because the weights are the grouping variable:

library(dplyr)
options(pillar.sigfig=10) # To check they are identical results
    
mtcars %>%
  group_by(cyl) %>%
  summarise(avg_mpg = mean(mpg))

Output:

The result is identical to the ones proposed above:

# A tibble: 3 x 2
    cyl     avg_mpg
  <dbl>       <dbl>
1     4 26.66363636
2     6 19.74285714
3     8 15.1  

If you need a weighted mean based on another variable:

mtcars %>%
  group_by(cyl) %>%
  summarise(avg_mpg = weighted.mean(mpg, disp))

# A tibble: 3 x 2
    cyl     avg_mpg
  <dbl>       <dbl>
1     4 25.81985300
2     6 19.77197631
3     8 14.86285148
mpalanco
  • 12,960
  • 2
  • 59
  • 67