2

In R, working in the tidyverse: My data sources change. There's a column which is only present some weeks. When it is, I want to summarize it. Using iris as an example, suppose that Sepal.Width is sometimes missing. Conceptually, I want a function like this

library(tidyverse)

summIris <- function(irisDf){
  irisDf %>% 
    group_by(Species) %>% 
      summarise_ifPresent(
                Sepal.Length = mean(Sepal.Length),
                Sepal.Width = mean(Sepal.Width))
}

Which'd return

R >  summIris(iris  )
# A tibble: 3 x 3
  Species    Sepal.Length Sepal.Width
  <fct>             <dbl>       <dbl>
1 setosa             5.01        3.43
2 versicolor         5.94        2.77
3 virginica          6.59        2.97

 > summIris(iris %>% select(- Sepal.Width ))
# A tibble: 3 x 2
  Species    Sepal.Length 
  <fct>             <dbl> 
1 setosa             5.01 
2 versicolor         5.94  
3 virginica          6.59 

I could work around by wrapping the logic in if else. But is there something more concise and elegant?

David T
  • 1,993
  • 10
  • 18
  • Does this answer your question? [Execute dplyr operation only if column exists](https://stackoverflow.com/questions/45146688/execute-dplyr-operation-only-if-column-exists) – DSH Dec 04 '19 at 19:32
  • Since you're doing a summarize, do you want to do a `summarize_all`? Otherwise it's unclear how you would summarize the other columns, or would you just drop them? – MrFlick Dec 04 '19 at 19:35

1 Answers1

3

summarize_at allows you to define on which columns you execute the summary, and you can use starts_with, ends_with, matches, or contains to dynamically select columns.

library(dplyr)
iris %>%
  group_by(Species) %>%
  summarize_at(vars(starts_with("Sepal")), funs(mean(.)))
# # A tibble: 3 x 3
#   Species    Sepal.Length Sepal.Width
#   <fct>             <dbl>       <dbl>
# 1 setosa             5.01        3.43
# 2 versicolor         5.94        2.77
# 3 virginica          6.59        2.97
iris %>%
  select(-Sepal.Length) %>%
  group_by(Species) %>%
  summarize_at(vars(starts_with("Sepal")), funs(mean(.)))
# # A tibble: 3 x 2
#   Species    Sepal.Width
#   <fct>            <dbl>
# 1 setosa            3.43
# 2 versicolor        2.77
# 3 virginica         2.97

Another one also works but gives a warning with unfound columns:

iris %>%
  select(-Sepal.Length) %>%
  group_by(Species) %>%
  summarize_at(vars(one_of(c("Sepal.Width", "Sepal.Length"))), funs(mean(.)))
# Warning: Unknown columns: `Sepal.Length`
# # A tibble: 3 x 2
#   Species    Sepal.Width
#   <fct>            <dbl>
# 1 setosa            3.43
# 2 versicolor        2.77
# 3 virginica         2.97
r2evans
  • 141,215
  • 6
  • 77
  • 149
  • 1
    Or use `matches` to select either one of them `summarize_at(vars(matches("Sepal.*(Width|Length)")), mean)` without warning. The `funs` would be replaced in the recent versions with `list` – akrun Dec 04 '19 at 20:38