0

I am trying to get the standard deviation for one column in a data frame, grouped by several other columns.

x <- c("Paul", "Paul", "Paul", "Jennifer", "Jennifer", "Jennifer")
y <- c("a", "a", "b", "c", "c", "d")
g <- c("eins", "eins", "zwei", "drei", "drei", "vier")
z <- c(1,2,3,4,5,6)
df <- tibble(Fall = x, DRG = y, DRG2 = g, Anzahl = z)

df$Fall <- as.factor(df$Fall)
df$DRG <- as.factor(df$DRG)
df$DRG2 <- as.factor(df$DRG2)



This is the tibble:

df
# A tibble: 6 x 4
  Fall     DRG   DRG2  Anzahl
  <fct>    <fct> <fct>  <dbl>
1 Paul     a     eins       1
2 Paul     a     eins       2
3 Paul     b     zwei       3
4 Jennifer c     drei       4
5 Jennifer c     drei       5
6 Jennifer d     vier       6

Calculating the mean works:

aggregate(x = df, 
          by = list(df$Fall, df$DRG, df$DRG2),
          FUN = mean, na.rm = TRUE)

   Group.1 Group.2 Group.3 Fall DRG DRG2 Anzahl
1 Jennifer       c    drei   NA  NA   NA    4.5
2     Paul       a    eins   NA  NA   NA    1.5
3 Jennifer       d    vier   NA  NA   NA    6.0
4     Paul       b    zwei   NA  NA   NA    3.0

Standard deviation gives me an error:

aggregate(x = df, 
          by = list(df$Fall, df$DRG, df$DRG2),
          FUN = sd, na.rm = TRUE)

Error in var(if (is.vector(x) || is.factor(x)) x else as.double(x), na.rm = na.rm) : 
  Calling var(x) on a factor x is defunct.
  Use something like 'all(duplicated(x)[-1L])' to test for a constant vector.

Why is that? I tried to understand the error message but i don't understand why it works with mean but not with standard deviation. If i turn all the factors to characters, then standard deviation works and gives me correct result. Why is that?

Regards

Dutschke
  • 277
  • 2
  • 15

2 Answers2

1

You are passing the entire dataframe as x argument. Instead you should pass variables that you want to aggregate. There are two ways you can use the aggregate function.

  1. Using values -
aggregate(x = df$Anzahl, 
          by = list(df$Fall, df$DRG, df$DRG2),
          FUN = mean, na.rm = TRUE)
  1. Using formula syntax :
aggregate(Anzahl~ Fall + DRG + DRG2, df, FUN = mean, na.rm = TRUE)

The same would work with sd function as well.


In your attempt mean/sd of all the values will be calculated. The output of mean and sd is different when passed factor variables.

mean(df$Fall)
#[1] NA

but returns a warning

Warning message: In mean.default(df$Fall) : argument is not numeric or logical: returning NA

whereas sd returns an error.

sd(df$Fall)

Error in var(if (is.vector(x) || is.factor(x)) x else as.double(x), na.rm = na.rm) : Calling var(x) on a factor x is defunct. Use something like 'all(duplicated(x)[-1L])' to test for a constant vector.

Hence, mean seems to work whereas sd returns an error.

Ronak Shah
  • 377,200
  • 20
  • 156
  • 213
  • That's true! But why does R give me an Error with sd, but not with mean in this case? I am trying to figure out the difference between these two. – Dutschke Oct 12 '20 at 09:48
  • Hope my update in the answer helps to understand the difference between the two. – Ronak Shah Oct 12 '20 at 09:53
  • Is this just random or is there a concept behind it? – Dutschke Oct 12 '20 at 09:55
  • I don't know what you are asking. The behaviour seems completely logical to me. `mean` on factor variables return `NA` with warning whereas it calculates actual mean for numeric variables. `sd` on the other hand returns an error when ran on factor variables. warning does not stop execution of the function whereas error stops the execution. Which part of this you think is random? – Ronak Shah Oct 12 '20 at 09:58
  • To me it's random that the function mean returns a warning and the function sd returns an error. – Dutschke Oct 12 '20 at 10:01
  • Ok..I agree with you on that. For explanation on "why" `mean` and `sd` behave differently I think we need to ask the person who wrote this function for his/her rationale behind it. I am sure there must be some reason/discussion about it. – Ronak Shah Oct 12 '20 at 10:07
0

We can use dplyr

library(dplyr)
df %>%
    group_by(Fall, DRG, DRG2) %>%
    summarise(Anzahl = mean(Anzahl, na.rm = TRUE))
akrun
  • 874,273
  • 37
  • 540
  • 662