1

I'm trying to dplyr::summarise a dataset (collapse) by different summarise_at/summarise_if functions so that I have the same named variables in my output dataset. Example:

library(tidyverse)
data(iris)
iris$year <- rep(c(2000,3000),each=25) ## for grouping
iris$color <- rep(c("red","green","blue"),each=50) ## character column
iris$letter <- as.factor(rep(c("A","B","C"),each=50)) ## factor column
head(iris, 3)
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species year color letter
1          5.1         3.5          1.4         0.2  setosa 2000   red      A
2          4.9         3.0          1.4         0.2  setosa 2000   red      A
3          4.7         3.2          1.3         0.2  setosa 2000   red      A

The resulting dataset should look like this:

full
  Species     year Sepal.Width Petal.Width Sepal.Length Petal.Length letter color
  <fct>      <dbl>       <dbl>       <dbl>        <dbl>        <dbl> <fct>  <chr>
1 setosa      2000        87           6.2          5.8          1.9 A      red  
2 setosa      3000        84.4         6.1          5.5          1.9 A      red  
3 versicolor  2000        69.4        33.6          7            4.9 B      green
4 versicolor  3000        69.1        32.7          6.8          5.1 B      green
5 virginica   2000        73.2        51.1          7.7          6.9 C      blue 
6 virginica   3000        75.5        50.2          7.9          6.4 C      blue 

I can achieve this by doing the following which is a bit repetitive:

sums <- iris %>% 
  group_by(Species, year) %>% 
  summarise_at(vars(matches("Width")), list(sum))
max <- iris %>% 
  group_by(Species, year) %>% 
  summarise_at(vars(matches("Length")), list(max))
last <- iris %>% 
  group_by(Species, year) %>% 
  summarise_if(is.factor, list(last))
first <- iris %>% 
  group_by(Species, year) %>% 
  summarise_if(is.character, list(first))

full <- full_join(sums, max) %>% full_join(last) %>%  full_join(first) 

I have found similar approaches below but can't figure out the approach I've tried here. I would prefer not to make my own function as I think something like this is cleaner by passing everything through a pipe and joining:

test <- iris %>%
  #group_by(.vars = vars(Species, year)) %>% #why doesnt this work?
  group_by_at(.vars = vars(Species, year))  %>% #doesnt work 
    {left_join(
    summarise_at(., vars(matches("Width")), list(sum)),
    summarise_at(., vars(matches("Length")), list(max)),
    summarise_if(., is.factor, list(last)),
    summarise_if(., is.character, list(first))
    )
      } #doesnt work

This doesnt work, any suggestions or other approaches?

Helpful: How can I use summarise_at to apply different functions to different columns? Summarize different Columns with different Functions Using dplyr summarize with different operations for multiple columns

user63230
  • 4,095
  • 21
  • 43
  • You want the sum of the width columns, max of the Lengths columns, last letter and first color? Than can be done using summarise function e.g `summarise(min(Sepal.Width), max(Sepal.Length))` – Vikrant Sep 08 '19 at 15:17
  • Your second approach would only work if you would have 2 summarise statements because full_join can only join 2 data frames – domaeg Sep 08 '19 at 16:50
  • `left_join` edited. @Vikrant that wouldnt be flexible enough if I had a large dataset with many variables. Using `summarise_at`/`summarise_if` gets around this. – user63230 Sep 08 '19 at 19:52

2 Answers2

1

By default, the dplyr::left_join() function only accepts two data frames. If you want to use this function with more than two data frames, you can iterate it with the Reduce function (base R function):


iris %>%
  group_by(Species, year) %>%
  {
    Reduce(
      function(x, y) left_join(x, y),
      list(
        summarise_at(., vars(matches("Width")), base::sum),
        summarise_at(., vars(matches("Length")), base::max),
        summarise_if(., is.factor, dplyr::last),
        summarise_if(., is.character, dplyr::first)
      ))
  }
#   Species     year Sepal.Width Petal.Width Sepal.Length Petal.Length letter color
#   <fct>      <dbl>       <dbl>       <dbl>        <dbl>        <dbl> <fct>  <chr>
# 1 setosa      2000        87           6.2          5.8          1.9 A      red  
# 2 setosa      3000        84.4         6.1          5.5          1.9 A      red  
# 3 versicolor  2000        69.4        33.6          7            4.9 B      green
# 4 versicolor  3000        69.1        32.7          6.8          5.1 B      green
# 5 virginica   2000        73.2        51.1          7.7          6.9 C      blue 
# 6 virginica   3000        75.5        50.2          7.9          6.4 C      blue 

Furthermore, notice I had to call functions from its package by using :: in order to avoid name overlapping with previously created data frames.

Ulises Rosas-Puchuri
  • 1,900
  • 10
  • 12
1

Robbing @Ulises idea and using purrr::reduce instead of Reduce is an alternative:

iris %>%
  group_by(Species, year) %>%
  list(
    summarise_at(., vars(matches("Width")), base::sum),
    summarise_at(., vars(matches("Length")), base::max),
    summarise_if(., is.factor, dplyr::last),
    summarise_if(., is.character, dplyr::first)
  ) %>%
  .[c(2:5)] %>%
  reduce(left_join)

OR solution with curly brackets to suppress the first argument:

iris %>%
  group_by(Species, year) %>%
  {
  list(
    summarise_at(., vars(matches("Width")), base::sum),
    summarise_at(., vars(matches("Length")), base::max),
    summarise_if(., is.factor, dplyr::last),
    summarise_if(., is.character, dplyr::first)
  )
  } %>%
  reduce(left_join)
user63230
  • 4,095
  • 21
  • 43
  • this can be quite slow for larger datasets see other workarounds https://stackoverflow.com/questions/65083735/efficient-summarise-of-different-columns-with-different-functions-with-dplyr – user63230 Mar 24 '21 at 12:33