Is there a way to instruct dplyr
to use summarise_each
with na.rm=TRUE
? I would like to take the mean of variables with summarise_each("mean")
but I don't know how to specify it to ignore missing values.
5 Answers
Following the links in the doc, it seems you can use funs(mean(., na.rm = TRUE))
:
library(dplyr)
by_species <- iris %>% group_by(Species)
by_species %>% summarise_each(funs(mean(., na.rm = TRUE)))

- 87,577
- 21
- 185
- 223
-
12After some years a comment to that: `summarise_each()` is deprecated . In `summarise_all`, you can add `na.rm = TRUE` after the `funs` argument - That is useful when you want to call more than one function such as: `iris %>% group_by(Species) %>% summarise_all(funs(mean, max, sd), na.rm = TRUE)` – tjebo Jan 09 '18 at 17:51
update
the current dplyr version strongly suggests the use of across
instead of the more specified functions summarise_all
etc.
Translating the below syntax (naming the functions in a named list) into across
could look like this:
library(dplyr)
ggplot2::msleep %>%
select(vore, sleep_total, sleep_rem) %>%
group_by(vore) %>%
summarise(across(everything(), .f = list(mean = mean, max = max, sd = sd), na.rm = TRUE))
#> # A tibble: 5 x 7
#> vore sleep_total_mean sleep_total_max sleep_total_sd sleep_rem_mean
#> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 carni 10.4 19.4 4.67 2.29
#> 2 herbi 9.51 16.6 4.88 1.37
#> 3 inse~ 14.9 19.9 5.92 3.52
#> 4 omni 10.9 18 2.95 1.96
#> 5 <NA> 10.2 13.7 3.00 1.88
#> # ... with 2 more variables: sleep_rem_max <dbl>, sleep_rem_sd <dbl>
older answer
summarise_each
is deprecated now, here an option with summarise_all
.
- One can still specify
na.rm = TRUE
within thefuns
argument (cf @flodel 's answer: just replacesummarise_each
withsummarise_all
). - But you can also add
na.rm = TRUE
after thefuns
argument.
That is useful when you want to call more than only one function, e.g.:
edit
the funs()
argument is now (soft)deprecated, thanks to comment @Mikko. One can use the suggestions that are given by the warning, see below in the code. na.rm
can still be specified as additional argument within summarise_all
.
I used ggplot2::msleep
because it contains NAs and shows this better.
library(dplyr)
ggplot2::msleep %>%
select(vore, sleep_total, sleep_rem) %>%
group_by(vore) %>%
summarise_all(funs(mean, max, sd), na.rm = TRUE)
#> Warning: funs() is soft deprecated as of dplyr 0.8.0
#> Please use a list of either functions or lambdas:
#>
#> # Simple named list:
#> list(mean = mean, median = median)
#>
#> # Auto named with `tibble::lst()`:
#> tibble::lst(mean, median)
#>
#> # Using lambdas
#> list(~ mean(., trim = .2), ~ median(., na.rm = TRUE))

- 21,977
- 7
- 58
- 94
-
3
-
5Just for completeness, e.g. using list(...) `summarise_all(list( minimum = ~ min(., na.rm = TRUE), maximum = ~ max(., na.rm = TRUE), s_dev = ~ sd(., na.rm = TRUE)))` – jclouse Aug 21 '19 at 17:57
Take for instance mtcars
data set
library(dplyr)
You can always use summarise
to avoid long syntax:
mtcars %>%
group_by(cyl) %>%
summarise(mean_mpg = mean(mpg, na.rm=T),
sd_mpg = sd(mpg, na.rm = T))

- 2,444
- 11
- 40
- 58

- 21
- 2
I don't know if my answer will add something to the previous comments. Hopefully yes.
In my case, I had a database from an experiment with two groups (control, exp) with different levels for a specific variable (day) and I wanted to get a summary of mean and sd of another variable (weight) for each group for specific levels of the variable day.
Here is an example of my database:
animal group day weight 1.1 "control" 73 NA 1.2 "control" 73 NA 3.1 "control" 73 NA 9.2 "control" 73 25.2 9.3 "control" 73 23.4 9.4 "control" 73 25.8 2.1 "exp" 73 NA 2.2 "exp" 73 NA 10.1 "exp" 73 24.4 10.2 "exp" 73 NA 10.3 "exp" 73 24.6
So, for instance, in this case I wanted to get the mean and sd of the weight on day 73 for each of the groups (control, exp), omitting the NAs.
I did this with this command:
data[data$day=="73",] %>% group_by(group) %>% summarise(mean(weight[group == "exp"], na.rm=T),sd(weight[group == "exp"], na.rm=T))
data[data$day=="73",] %>% group_by(group) %>% summarise(mean(weight[group == "control"], na.rm=T),sd(weight[group == "control"], na.rm=T))

- 11
- 2
summarise_at
function in dplyr
will summarise a dataset at specific column and allow to remove NAs for each functions applied. Take iris dataset and compute mean and median for variables from Sepal.Length to Petal.Width.
library(dplyr)
summarise_at(iris,vars(Sepal.Length:Petal.Width),funs(mean,median),na.rm=T)

- 11
- 2