8

I am fairly new to R and even newer to dplyr. I have a small data set comprised of 2 columns - var1 and var2. The var1 column is comprised of num values. The var2 column is comprised of factors with 3 levels - A, B, and C.

        var1 var2
1  1.4395244    A
2  1.7698225    A
3  3.5587083    A
4  2.0705084    A
5  2.1292877    A
6  3.7150650    B
7  2.4609162    B
8  0.7349388    B
9  1.3131471    B
10 1.5543380    B
11 3.2240818    C
12 2.3598138    C
13 2.4007715    C
14 2.1106827    C
15 1.4441589    C

'data.frame':   15 obs. of  2 variables:
 $ var1: num  1.44 1.77 3.56 2.07 2.13 ...
 $ var2: Factor w/ 3 levels "A","B","C": 1 1 1 1 1 2 2 2 2 2 ...

I am trying to use dplyr to group_by var2 (A, B, and C) then count, and summarize the var1 by mean and sd. The count works but rather than provide the mean and sd for each group, I receive the overall mean and sd next to each group.

To try to resolve the issue, I have conducted multiple internet searches. All results seem to offer a similar syntax to the one I am using. I have also read through all of the recommended posts that Stack Overflow offered prior to posting. Also, I tried restarting R and I made sure that I am not using plyr.

Here is the code that I used to create the data set and the dplyr group_by / summarize.

library(dplyr)
set.seed(123)
var1 <- rnorm(15, mean=2, sd=1)
var2 <- c("A", "A", "A", "A", "A", "B", "B", "B", "B", "B",
          "C", "C", "C", "C", "C")
df <- data.frame(var1, var2)
df

df %>%
  group_by(df$var2) %>%
  summarize(
    count = n(),
    mean = mean(df$var1, na.rm = TRUE),
    sd = sd(df$var1, na.rm = TRUE)
  )

Here are the results:

# A tibble: 3 x 4
  `df$var2` count  mean    sd
  <fct>     <int> <dbl> <dbl>
1 A             5  2.15 0.845
2 B             5  2.15 0.845
3 C             5  2.15 0.845

The count appears to work showing a count of 5 for each group. Each group is showing the overall mean and sd for the whole column rather than each group. The expected results are the count, mean, and sd for each group.

I am sure I am overlooking something obvious but I would greatly appreciate any assistance.

Adrian Mole
  • 49,934
  • 160
  • 51
  • 83
earlev4
  • 83
  • 1
  • 5
  • 4
    don't use `$` when referring to column names in `dplyr`, `df %>% group_by(var2) %>% summarize( count = n(), mean = mean(var1, na.rm = TRUE), sd = sd(var1, na.rm = TRUE) )` – Ronak Shah Jul 25 '19 at 04:25
  • 2
    You want `group_by(var2)`, `mean(var1)` and `sd(var1)`, **not** `mean(df$var1)`, `sd(df$var1)`. The second syntax gives the value for the entire column, not the grouped variable. – neilfws Jul 25 '19 at 04:25
  • Thanks so much!!! Both solutions worked like a charm. I am very grateful for the help. I appreciate it. – earlev4 Jul 25 '19 at 04:31

1 Answers1

8

Even though answered via comments, I felt such a nice reproducible example for a very first question deserved an official answer.

library(dplyr)
set.seed(123)
var1 <- rnorm(15, mean=2, sd=1)
var2 <- c(rep("A", 5), rep("B", 5), rep("C", 5))
df <- data.frame(var1, var2) 
df_stat <- df %>% group_by(var2) %>% summarize(
                                      count = n(),
                                       mean = mean(var1, na.rm = TRUE), 
                                         sd = sd(var1, na.rm = TRUE)) 
head(df_stat)
# A tibble: 3 x 4
# var2   count  mean    sd
# <fct>  <int>  <dbl>  <dbl>
# 1 A      5    2.19   0.811
# 2 B      5    1.96   1.16 
# 3 C      5    2.31   0.639
dbo
  • 1,174
  • 1
  • 11
  • 19