0

Let's say I have a dataframe called DT. I want to group based upon the "Country" column, and find the maximum value for columns B, C, D, E, F. I then want to create new columns called B_max, C_max, D_max, E_max, and F_max that have the maximum value for each row.

I can trivially do this in Base R with a long list of commands. I'm wondering if there is a more parsimonious way to do this in dplyr? So, for example:

DT %>% 
  group_by(Country) %>% 
  select(a, b, c, d, e, f)%>%
  mutate(max = max())

That gets me the correct maximum values in a dataframe which is collapsed by country. How would I then un-collapse this data-frame and create all the a_max, b_max, c_max, and so on?

Jaap
  • 81,064
  • 34
  • 182
  • 193
Parseltongue
  • 11,157
  • 30
  • 95
  • 160
  • 3
    If you can do this in base R, why didn't you show it then? It could make it easier for others to understand what you are trying to do. Also, see the guidelines on how to provide a [reproducible example](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example/5963610). – Jaap Jan 03 '18 at 17:54

1 Answers1

3

Here's a solution using the mtcars dataset:

library(dplyr)

mtcars %>%
  group_by(cyl) %>%
  select(cyl, disp, hp, drat) %>%
  mutate_all(funs(max = max)) %>%
  ungroup()

# # A tibble: 32 x 7
#     cyl  disp    hp  drat disp_max hp_max drat_max
#   <dbl> <dbl> <dbl> <dbl>    <dbl>  <dbl>    <dbl>
# 1  6.00   160 110    3.90      258    175     3.92
# 2  6.00   160 110    3.90      258    175     3.92
# 3  4.00   108  93.0  3.85      147    113     4.93
# 4  6.00   258 110    3.08      258    175     3.92
# 5  8.00   360 175    3.15      472    335     4.22
# 6  6.00   225 105    2.76      258    175     3.92
# 7  8.00   360 245    3.21      472    335     4.22
# 8  4.00   147  62.0  3.69      147    113     4.93
# 9  4.00   141  95.0  3.92      147    113     4.93
# 10 6.00   168 123    3.92      258    175     3.92
# # ... with 22 more rows

Note that (a) you have to include the grouping variable inside select, otherwise the process will force it to be included and show you a warning, (b) you need to use funs() within mutate_all if you want to have the _max prefix and keep the original columns, (c) the grouping variable is not affected by mutate_all.

Otherwise, if you don't use funs() it will look like this:

mtcars %>%
  group_by(cyl) %>%
  select(cyl, disp, hp, drat) %>%
  mutate_all(max) %>%
  ungroup()

# # A tibble: 32 x 4
#     cyl  disp    hp  drat
#   <dbl> <dbl> <dbl> <dbl>
# 1  6.00   258   175  3.92
# 2  6.00   258   175  3.92
# 3  4.00   147   113  4.93
# 4  6.00   258   175  3.92
# 5  8.00   472   335  4.22
# 6  6.00   258   175  3.92
# 7  8.00   472   335  4.22
# 8  4.00   147   113  4.93
# 9  4.00   147   113  4.93
# 10 6.00   258   175  3.92
# # ... with 22 more rows

You can see that in this case the original columns were replaced by the max values as they have the same names.

If you want to do follow the same method, but without select in order to keep all your original columns you can use mutate_at instead like this:

library(dplyr)

mtcars %>%
  group_by(cyl) %>%
  mutate_at(vars(disp,hp,drat), funs(max = max)) %>%
  ungroup()

# # A tibble: 32 x 14
#     mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb disp_max hp_max drat_max
#   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>    <dbl>  <dbl>    <dbl>
# 1  21.0  6.00   160 110    3.90  2.62  16.5  0     1.00  4.00  4.00      258    175     3.92
# 2  21.0  6.00   160 110    3.90  2.88  17.0  0     1.00  4.00  4.00      258    175     3.92
# 3  22.8  4.00   108  93.0  3.85  2.32  18.6  1.00  1.00  4.00  1.00      147    113     4.93
# 4  21.4  6.00   258 110    3.08  3.22  19.4  1.00  0     3.00  1.00      258    175     3.92
# 5  18.7  8.00   360 175    3.15  3.44  17.0  0     0     3.00  2.00      472    335     4.22
# 6  18.1  6.00   225 105    2.76  3.46  20.2  1.00  0     3.00  1.00      258    175     3.92
# 7  14.3  8.00   360 245    3.21  3.57  15.8  0     0     3.00  4.00      472    335     4.22
# 8  24.4  4.00   147  62.0  3.69  3.19  20.0  1.00  0     4.00  2.00      147    113     4.93
# 9  22.8  4.00   141  95.0  3.92  3.15  22.9  1.00  0     4.00  2.00      147    113     4.93
# 10 19.2  6.00   168 123    3.92  3.44  18.3  1.00  0     4.00  4.00      258    175     3.92
# # ... with 22 more rows
AntoniosK
  • 15,991
  • 2
  • 19
  • 32
  • Great call on mutate_all and including the naming function. Thanks a bunch! Quick issue -- this will preserve only 7 columns (cyl, disp, hp, drat, and the last three with the _max suffix). How do I "unselect" columns to get the original dataframe columns within the dplyr chain? – Parseltongue Jan 03 '18 at 18:03
  • @Parseltongue Do you mean applying `max` to all columns, or to some columns but also keep the rest of the columns? – AntoniosK Jan 03 '18 at 18:06
  • Sorry, I'm just wondering if there is a way to now take the columns disp_max, hp_max, and drat_max and add them to the original mtcars dataset. I can do this in a really circuitous fashion, but was wondering if there is a way to do this within the dplyr pipeline. – Parseltongue Jan 03 '18 at 18:09
  • mutate_at! Did not know about this. Great work! Thanks so much! – Parseltongue Jan 03 '18 at 18:11
  • 1
    Also, good to know you can pass multiple functions to apply to your columns of choice. You can use `funs(max = max, min = min)` within `mutate_at` – AntoniosK Jan 03 '18 at 18:12