dplyr summarise (collapse) dataset by different functions for multiple columns

Question

I'm trying to dplyr::summarise a dataset (collapse) by different summarise_at/summarise_if functions so that I have the same named variables in my output dataset. Example:

library(tidyverse)
data(iris)
iris$year <- rep(c(2000,3000),each=25) ## for grouping
iris$color <- rep(c("red","green","blue"),each=50) ## character column
iris$letter <- as.factor(rep(c("A","B","C"),each=50)) ## factor column
head(iris, 3)
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species year color letter
1          5.1         3.5          1.4         0.2  setosa 2000   red      A
2          4.9         3.0          1.4         0.2  setosa 2000   red      A
3          4.7         3.2          1.3         0.2  setosa 2000   red      A

The resulting dataset should look like this:

full
  Species     year Sepal.Width Petal.Width Sepal.Length Petal.Length letter color
  <fct>      <dbl>       <dbl>       <dbl>        <dbl>        <dbl> <fct>  <chr>
1 setosa      2000        87           6.2          5.8          1.9 A      red  
2 setosa      3000        84.4         6.1          5.5          1.9 A      red  
3 versicolor  2000        69.4        33.6          7            4.9 B      green
4 versicolor  3000        69.1        32.7          6.8          5.1 B      green
5 virginica   2000        73.2        51.1          7.7          6.9 C      blue 
6 virginica   3000        75.5        50.2          7.9          6.4 C      blue

I can achieve this by doing the following which is a bit repetitive:

sums <- iris %>% 
  group_by(Species, year) %>% 
  summarise_at(vars(matches("Width")), list(sum))
max <- iris %>% 
  group_by(Species, year) %>% 
  summarise_at(vars(matches("Length")), list(max))
last <- iris %>% 
  group_by(Species, year) %>% 
  summarise_if(is.factor, list(last))
first <- iris %>% 
  group_by(Species, year) %>% 
  summarise_if(is.character, list(first))

full <- full_join(sums, max) %>% full_join(last) %>%  full_join(first)

I have found similar approaches below but can't figure out the approach I've tried here. I would prefer not to make my own function as I think something like this is cleaner by passing everything through a pipe and joining:

test <- iris %>%
  #group_by(.vars = vars(Species, year)) %>% #why doesnt this work?
  group_by_at(.vars = vars(Species, year))  %>% #doesnt work 
    {left_join(
    summarise_at(., vars(matches("Width")), list(sum)),
    summarise_at(., vars(matches("Length")), list(max)),
    summarise_if(., is.factor, list(last)),
    summarise_if(., is.character, list(first))
    )
      } #doesnt work

This doesnt work, any suggestions or other approaches?

Helpful: How can I use summarise_at to apply different functions to different columns? Summarize different Columns with different Functions Using dplyr summarize with different operations for multiple columns

You want the sum of the width columns, max of the Lengths columns, last letter and first color? Than can be done using summarise function e.g `summarise(min(Sepal.Width), max(Sepal.Length))` — Vikrant, Sep 08 '19 at 15:17
Your second approach would only work if you would have 2 summarise statements because full_join can only join 2 data frames — domaeg, Sep 08 '19 at 16:50
`left_join` edited. @Vikrant that wouldnt be flexible enough if I had a large dataset with many variables. Using `summarise_at`/`summarise_if` gets around this. — user63230, Sep 08 '19 at 19:52

Ulises Rosas-Puchuri · Accepted Answer · 2019-09-09T00:52:54.290

By default, the dplyr::left_join() function only accepts two data frames. If you want to use this function with more than two data frames, you can iterate it with the Reduce function (base R function):


iris %>%
  group_by(Species, year) %>%
  {
    Reduce(
      function(x, y) left_join(x, y),
      list(
        summarise_at(., vars(matches("Width")), base::sum),
        summarise_at(., vars(matches("Length")), base::max),
        summarise_if(., is.factor, dplyr::last),
        summarise_if(., is.character, dplyr::first)
      ))
  }

#   Species     year Sepal.Width Petal.Width Sepal.Length Petal.Length letter color
#   <fct>      <dbl>       <dbl>       <dbl>        <dbl>        <dbl> <fct>  <chr>
# 1 setosa      2000        87           6.2          5.8          1.9 A      red  
# 2 setosa      3000        84.4         6.1          5.5          1.9 A      red  
# 3 versicolor  2000        69.4        33.6          7            4.9 B      green
# 4 versicolor  3000        69.1        32.7          6.8          5.1 B      green
# 5 virginica   2000        73.2        51.1          7.7          6.9 C      blue 
# 6 virginica   3000        75.5        50.2          7.9          6.4 C      blue

Furthermore, notice I had to call functions from its package by using :: in order to avoid name overlapping with previously created data frames.

thats it, thank you. Would `purrr::reduce` simplify it even more? — user63230, Sep 10 '19 at 08:39
I didn't know purr can also help. Answer down below looks great, @user63230 — Ulises Rosas-Puchuri, Sep 13 '19 at 20:40

score 1 · Answer 2 · answered Sep 11 '19 at 16:33

Robbing @Ulises idea and using purrr::reduce instead of Reduce is an alternative:

iris %>%
  group_by(Species, year) %>%
  list(
    summarise_at(., vars(matches("Width")), base::sum),
    summarise_at(., vars(matches("Length")), base::max),
    summarise_if(., is.factor, dplyr::last),
    summarise_if(., is.character, dplyr::first)
  ) %>%
  .[c(2:5)] %>%
  reduce(left_join)

OR solution with curly brackets to suppress the first argument:

iris %>%
  group_by(Species, year) %>%
  {
  list(
    summarise_at(., vars(matches("Width")), base::sum),
    summarise_at(., vars(matches("Length")), base::max),
    summarise_if(., is.factor, dplyr::last),
    summarise_if(., is.character, dplyr::first)
  )
  } %>%
  reduce(left_join)

this can be quite slow for larger datasets see other workarounds https://stackoverflow.com/questions/65083735/efficient-summarise-of-different-columns-with-different-functions-with-dplyr — user63230, Mar 24 '21 at 12:33

dplyr summarise (collapse) dataset by different functions for multiple columns

2 Answers2