0

I have a dataframe with more than 190,000 rows, something a bit like this:

 library(tibble)
 mydf <- tribble(~col1, ~col2, ~col3, ~col4, ~col5,
            "A", 16, 45, 53, 35, 
            "A", 17,  12, 54, 12,
            "A", 19, 12, 54, 35,
            "B", 10, 87, 55, 22,
            "B", 10, 87, 55, 22,
            "B", 12, 23, 12, 67)

There are repeated iterations of col1; some have the same values across the columns, others have different values across columns, as shown in the sample dataframe.

For each repeated level in col1, I want to aggregate these values into one row showing the mean of all the rows. I got so far using this answer, however, this leaves me with all the distinct rows:

 mydf %>% group_by(col1) %>% 
   mutate_each(funs(mean), -(1)) %>% 
   distinct()

 # A tibble: 5 x 5
 # Groups:   col1 [2]
   col1   col2  col3  col4  col5
   <chr> <dbl> <dbl> <dbl> <dbl>
 1 A        16  23    53.7  27.3
 2 A        17  23    53.7  27.3
 3 A        19  23    53.7  27.3
 4 B        10  65.7  40.7  37  
 5 B        12  65.7  40.7  37  

What I actually want here is one row for A, B, etc, showing the mean values.

Catherine Laing
  • 475
  • 6
  • 18

2 Answers2

1

You need to use summarize instead of mutate, to summarize the grouped values. In this case I'm using summarize_all to summarize all non-grouped values.

library(tidyverse)

mydf <- tribble(~col1, ~col2, ~col3, ~col4, ~col5,
                "A", 16, 45, 53, 35, 
                "A", 17,  12, 54, 12,
                "A", 19, 12, 54, 35,
                "B", 10, 87, 55, 22,
                "B", 10, 87, 55, 22,
                "B", 12, 23, 12, 67)

mydf %>% 
  group_by(col1) %>% 
  summarize_all(.funs = list(mean))

# A tibble: 2 x 5
  col1   col2  col3  col4  col5
  <chr> <dbl> <dbl> <dbl> <dbl>
1 A      17.3  23    53.7  27.3
2 B      10.7  65.7  40.7  37  
Jake Kaupp
  • 7,892
  • 2
  • 26
  • 36
0

If cols 2 through 5 are all observations of the same measurement, it makes sense to use tidyr::pivot_longer(). Then take group by col1 and the new cols 2 through 5 group. Finally, to get in the form you want, use tidyr::pivot_wider()

library(tidyverse)

mydf <- tribble(~col1, ~col2, ~col3, ~col4, ~col5,
                "A", 16, 45, 53, 35,
                "A", 17,  12, 54, 12,
                "A", 19, 12, 54, 35,
                "B", 10, 87, 55, 22,
                "B", 10, 87, 55, 22,
                "B", 12, 23, 12, 67)

mydf %>%
  pivot_longer(cols = -col1) %>%
  group_by(col1, name) %>%
  summarise(mean = mean(value)) %>%
  pivot_wider(names_from = name, values_from = mean)
#> # A tibble: 2 x 5
#> # Groups:   col1 [2]
#>   col1   col2  col3  col4  col5
#>   <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 A      17.3  23    53.7  27.3
#> 2 B      10.7  65.7  40.7  37

Created on 2020-03-20 by the reprex package (v0.3.0)

11rchitwood
  • 25
  • 1
  • 6
  • This gives me a little more than what I need, as I want to keep the values column-wise, and only summarise row-wise. But useful to know for future, thank you! – Catherine Laing Mar 20 '20 at 15:36
  • 1
    Glad to help. In my experience, pivoting back and forth is a useful pattern for many data analyses. Also, thank you so much for using `tibble::tribble()`! – 11rchitwood Mar 20 '20 at 15:45