R: How to aggregate some columns while keeping other columns

Question

I have a similar problem as described here, but none of the solutions from there which I have tried work.

Given a table like this:

Date    Exercise    Category    Weight  Reps    EstMax  RepxWeight  Note
4/2/16  Deadlift    Legs    135 7   166.4685    7x135   easy
4/2/16  Deadlift    Legs    135 7   166.4685    7x135   kinda easy
4/2/16  Deadlift    Legs    135 7   166.4685    7x135   tired
4/2/16  Bench Press Chest   95  5   110.8175    5x95    hard
4/2/16  Bench Press Chest   135 2   143.991 2x135   not hard
4/9/16  Bench Press Chest   135 2   143.991 2x135   a little hard
4/9/16  Bench Press Chest   135 2   143.991 2x135   super tired
4/18/16 Deadlift    Legs    155 8   196.292 8x155   …
4/18/16 Deadlift    Legs    155 5   180.8075    5x155   bad day
5/8/16  Deadlift    Legs    185 3   203.4815    3x185   good day
5/8/16  Deadlift    Legs    185 3   203.4815    3x185   felt easy
5/8/16  Bench Press Chest   115 4   130.318 4x115   easy
5/8/16  Bench Press Chest   115 4   130.318 4x115   hard

I want to aggregate to get the rows that have the max value for a certain column (e.g. EstMax) based on multiple other columns (e.g. Date and Exercise), but also keep all the other columns in the row. And in the case of multiple entries with the same max value, take the first entry.

The expected output would look like this:

Date    Exercise    Category    Weight  Reps    EstMax  RepxWeight  Note
4/2/16  Deadlift    Legs    135 7   166.4685    7x135   easy
4/2/16  Bench Press Chest   135 2   143.991 2x135   not hard
4/9/16  Bench Press Chest   135 2   143.991 2x135   a little hard
4/18/16 Deadlift    Legs    155 8   196.292 8x155   …
5/8/16  Deadlift    Legs    185 3   203.4815    3x185   good day
5/8/16  Bench Press Chest   115 4   130.318 4x115   hard

Examples of some method I've tried; in every case, the 'extra columns' end up being used as factors for the aggregation, which is not what I want.

data <- structure(list(Date = structure(c(2L, 2L, 2L, 2L, 2L, 3L, 3L, 
1L, 1L, 4L, 4L, 4L, 4L), .Label = c("4/18/16", "4/2/16", "4/9/16", 
"5/8/16"), class = "factor"), Exercise = structure(c(2L, 2L, 
2L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 1L, 1L), .Label = c("Bench Press", 
"Deadlift"), class = "factor"), Category = structure(c(2L, 2L, 
2L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 1L, 1L), .Label = c("Chest", 
"Legs"), class = "factor"), Weight = c(135L, 135L, 135L, 95L, 
135L, 135L, 135L, 155L, 155L, 185L, 185L, 115L, 115L), Reps = c(7L, 
7L, 7L, 5L, 2L, 2L, 2L, 8L, 5L, 3L, 3L, 4L, 4L), EstMax = c(166.4685, 
166.4685, 166.4685, 110.8175, 143.991, 143.991, 143.991, 196.292, 
180.8075, 203.4815, 203.4815, 130.318, 130.318), RepxWeight = structure(c(6L, 
6L, 6L, 5L, 1L, 1L, 1L, 7L, 4L, 2L, 2L, 3L, 3L), .Label = c("2x135", 
"3x185", "4x115", "5x155", "5x95", "7x135", "8x155"), class = "factor"), 
    Note = structure(c(4L, 8L, 11L, 7L, 9L, 2L, 10L, 1L, 3L, 
    6L, 5L, 4L, 7L), .Label = c("…", "a little hard", "bad day", 
    "easy", "felt easy", "good day", "hard", "kinda easy", "not hard", 
    "super tired", "tired"), class = "factor")), .Names = c("Date", 
"Exercise", "Category", "Weight", "Reps", "EstMax", "RepxWeight", 
"Note"), class = "data.frame", row.names = c(NA, -13L))

# base R
aggregate(EstMax ~ Date + Exercise, data = data, FUN = max)
# Date    Exercise   EstMax
# 1  4/2/16 Bench Press 143.9910
# 2  4/9/16 Bench Press 143.9910
# 3  5/8/16 Bench Press 130.3180
# 4 4/18/16    Deadlift 196.2920
# 5  4/2/16    Deadlift 166.4685
# 6  5/8/16    Deadlift 203.4815

aggregate(EstMax ~ Date + Exercise + RepxWeight + Note, data = data, FUN = max)
# Date    Exercise RepxWeight          Note   EstMax
# 1  4/18/16    Deadlift      8x155             … 196.2920
# 2   4/9/16 Bench Press      2x135 a little hard 143.9910
# 3  4/18/16    Deadlift      5x155       bad day 180.8075
# 4   5/8/16 Bench Press      4x115          easy 130.3180
# 5   4/2/16    Deadlift      7x135          easy 166.4685
# 6   5/8/16    Deadlift      3x185     felt easy 203.4815
# 7   5/8/16    Deadlift      3x185      good day 203.4815
# 8   5/8/16 Bench Press      4x115          hard 130.3180
# 9   4/2/16 Bench Press       5x95          hard 110.8175
# 10  4/2/16    Deadlift      7x135    kinda easy 166.4685
# 11  4/2/16 Bench Press      2x135      not hard 143.9910
# 12  4/9/16 Bench Press      2x135   super tired 143.9910
# 13  4/2/16    Deadlift      7x135         tired 166.4685


# data table
library("data.table")
data_dt <- data.table(data)
data_dt[ , max(EstMax), by = c("Date", "Exercise")]
# Date    Exercise       V1
# 1:  4/2/16    Deadlift 166.4685
# 2:  4/2/16 Bench Press 143.9910
# 3:  4/9/16 Bench Press 143.9910
# 4: 4/18/16    Deadlift 196.2920
# 5:  5/8/16    Deadlift 203.4815
# 6:  5/8/16 Bench Press 130.3180

data_dt[, max(EstMax), .(Date, Exercise, Weight, Reps, RepxWeight, Note)]
# Date    Exercise Weight Reps RepxWeight          Note       V1
# 1:  4/2/16    Deadlift    135    7      7x135          easy 166.4685
# 2:  4/2/16    Deadlift    135    7      7x135    kinda easy 166.4685
# 3:  4/2/16    Deadlift    135    7      7x135         tired 166.4685
# 4:  4/2/16 Bench Press     95    5       5x95          hard 110.8175
# 5:  4/2/16 Bench Press    135    2      2x135      not hard 143.9910
# 6:  4/9/16 Bench Press    135    2      2x135 a little hard 143.9910
# 7:  4/9/16 Bench Press    135    2      2x135   super tired 143.9910
# 8: 4/18/16    Deadlift    155    8      8x155             … 196.2920
# 9: 4/18/16    Deadlift    155    5      5x155       bad day 180.8075
# 10:  5/8/16    Deadlift    185    3      3x185      good day 203.4815
# 11:  5/8/16    Deadlift    185    3      3x185     felt easy 203.4815
# 12:  5/8/16 Bench Press    115    4      4x115          easy 130.3180
# 13:  5/8/16 Bench Press    115    4      4x115          hard 130.3180

Especially prefer base R solutions. Also saw the which.max() function which might be helpful but couldn't figure out how to apply it to this.

Other related questions which I looked at but did not solve this:

Adding a non-aggregated column to an aggregated data set based on the aggregation of another column

Only keep min value for each factor level

How to select the row with the maximum value in each group

aggregating multiple columns in data.table

How to aggregate some columns while keeping other columns in R?

aggregating by `Date` and `Exercise` to get the max `EstMax` value for each exercise for each day — user5359531, Nov 21 '17 at 18:23
Its not a duplicate because "Especially prefer base R solutions.". And as I described, the other solutions did not give the results I wanted. — user5359531, May 21 '20 at 19:11

Aurèle · Accepted Answer · 2017-11-21T17:51:40.517

I understand that you seek a base R solution, but in the meanwhile, here is a dplyr one:

library(dplyr)

data %>% 
  group_by(Date, Exercise) %>% 
  slice(which.max(EstMax))

# # A tibble: 6 x 8
# # Groups:   Date, Exercise [6]
#      Date    Exercise Category Weight  Reps   EstMax RepxWeight          Note
#    <fctr>      <fctr>   <fctr>  <int> <int>    <dbl>     <fctr>        <fctr>
# 1 4/18/16    Deadlift     Legs    155     8 196.2920      8x155             …
# 2  4/2/16 Bench Press    Chest    135     2 143.9910      2x135      not hard
# 3  4/2/16    Deadlift     Legs    135     7 166.4685      7x135          easy
# 4  4/9/16 Bench Press    Chest    135     2 143.9910      2x135 a little hard
# 5  5/8/16 Bench Press    Chest    115     4 130.3180      4x115          easy
# 6  5/8/16    Deadlift     Legs    185     3 203.4815      3x185      good day

Edit

data.table is not my forte, but for the sake of completeness, here's my attempt at it:

library(data.table)

setDT(data)[, .SD[which.max(EstMax)], by = .(Date, Exercise)]

#       Date    Exercise Category Weight Reps   EstMax RepxWeight          Note
# 1:  4/2/16    Deadlift     Legs    135    7 166.4685      7x135          easy
# 2:  4/2/16 Bench Press    Chest    135    2 143.9910      2x135      not hard
# 3:  4/9/16 Bench Press    Chest    135    2 143.9910      2x135 a little hard
# 4: 4/18/16    Deadlift     Legs    155    8 196.2920      8x155             …
# 5:  5/8/16    Deadlift     Legs    185    3 203.4815      3x185      good day
# 6:  5/8/16 Bench Press    Chest    115    4 130.3180      4x115          easy

By far the simplest `dplyr` solution, mimicking the base-R solution, but perhaps a little more readable. — r2evans, Nov 21 '17 at 17:18
The `data.table` method here is working well. However, if I replace `which.max` with `sum(EstMax, na.rm = TRUE)`, I get a table of all `NA`. Any idea what is going on? — user5359531, Nov 23 '17 at 05:56

r2evans · Answer 2 · 2017-11-21T17:41:37.613

One (not-so-correct) method, kept in order to show one problem with summarize all numeric columns independently:

grpvar <- c("Date", "Exercise", "Category")
merge(
  aggregate(data[,c("Weight", "Reps", "EstMax")], by = data[grpvar], FUN = max),
  aggregate(data[,c("RepxWeight", "Note")], by = data[grpvar], FUN = function(a) a[1]),
  by = grpvar
)
#      Date    Exercise Category Weight Reps   EstMax RepxWeight          Note
# 1 4/18/16    Deadlift     Legs    155    8 196.2920      8x155           ...
# 2  4/2/16 Bench Press    Chest    135    5 143.9910       5x95          hard
# 3  4/2/16    Deadlift     Legs    135    7 166.4685      7x135          easy
# 4  4/9/16 Bench Press    Chest    135    2 143.9910      2x135 a little hard
# 5  5/8/16 Bench Press    Chest    115    4 130.3180      4x115          easy
# 6  5/8/16    Deadlift     Legs    185    3 203.4815      3x185      good day

On 4/2/16, your bench press shows a max weight of 135 and max reps of 5, but the two did not occur in the same row in your data.

Here's a slightly (more correct) different approach, using your thought of which.max:

do.call(rbind,
        by(data, data[c("Date", "Exercise")],
           function(x) x[which.max(x$Weight),])
        )
#       Date    Exercise Category Weight Reps   EstMax RepxWeight          Note
# 5   4/2/16 Bench Press    Chest    135    2 143.9910      2x135      not hard
# 6   4/9/16 Bench Press    Chest    135    2 143.9910      2x135 a little hard
# 12  5/8/16 Bench Press    Chest    115    4 130.3180      4x115          easy
# 8  4/18/16    Deadlift     Legs    155    8 196.2920      8x155           ...
# 1   4/2/16    Deadlift     Legs    135    7 166.4685      7x135          easy
# 10  5/8/16    Deadlift     Legs    185    3 203.4815      3x185      good day

If for some reason it is possible to have one Exercise within more than one Category, you might want the second argument of by to be data[c("Date","Exercise","Category")] instead.

(You can order the output with something like x[order(as.Date(x$Date, format="%m/%d/%Y")),] ... in fact you probably meant for the $Date column to be an actual Date-class.)

acylam · Answer 3 · 2017-11-21T18:44:28.720

Here's another method with dplyr:

library(dplyr)
library(lubridate)

data %>%
  mutate(Date = mdy(Date)) %>%
  group_by(Date, Exercise) %>%
  arrange(desc(EstMax)) %>%
  slice(1)

Result:

# A tibble: 6 x 8
# Groups:   Date, Exercise [6]
        Date    Exercise Category Weight  Reps   EstMax RepxWeight          Note
      <date>      <fctr>   <fctr>  <int> <int>    <dbl>     <fctr>        <fctr>
1 2016-04-02 Bench Press    Chest    135     2 143.9910      2x135      not hard
2 2016-04-02    Deadlift     Legs    135     7 166.4685      7x135          easy
3 2016-04-09 Bench Press    Chest    135     2 143.9910      2x135 a little hard
4 2016-04-18    Deadlift     Legs    155     8 196.2920      8x155             …
5 2016-05-08 Bench Press    Chest    115     4 130.3180      4x115          easy
6 2016-05-08    Deadlift     Legs    185     3 203.4815      3x185      good day

or you can also use sqldf:

library(sqldf)
library(lubridate)

data$Date = mdy(data$Date)

sqldf("select *, max(EstMax) as EstMax2 from data
        group by Date, Exercise
        order by Date, Exercise")

Result:

        Date    Exercise Category Weight Reps   EstMax RepxWeight          Note  EstMax2
1 2016-04-02 Bench Press    Chest    135    2 143.9910      2x135      not hard 143.9910
2 2016-04-02    Deadlift     Legs    135    7 166.4685      7x135          easy 166.4685
3 2016-04-09 Bench Press    Chest    135    2 143.9910      2x135 a little hard 143.9910
4 2016-04-18    Deadlift     Legs    155    8 196.2920      8x155             … 196.2920
5 2016-05-08 Bench Press    Chest    115    4 130.3180      4x115          easy 130.3180
6 2016-05-08    Deadlift     Legs    185    3 203.4815      3x185      good day 203.4815

I like the `sqldf` suggestion, didn't even think of that, thanks — user5359531, Nov 21 '17 at 23:03

atalbot · Answer 4 · 2017-11-21T17:15:32.460

I know you prefer base R solutions, but dplyr offers a function 'top_n' which does exactly what you ask for.

Use it once to retrieve all instances of EstMax:

library(dplyr)

data %>%
  group_by(Exercise) %>%
  top_n(1, EstMax)

# A tibble: 5 x 8
# Groups:   Exercise [2]
    Date    Exercise Category Weight  Reps   EstMax RepxWeight          Note
  <fctr>      <fctr>   <fctr>  <int> <int>    <dbl>     <fctr>        <fctr>
1 4/2/16 Bench Press    Chest    135     2 143.9910      2x135      not hard
2 4/9/16 Bench Press    Chest    135     2 143.9910      2x135 a little hard
3 4/9/16 Bench Press    Chest    135     2 143.9910      2x135   super tired
4 5/8/16    Deadlift     Legs    185     3 203.4815      3x185      good day
5 5/8/16    Deadlift     Legs    185     3 203.4815      3x185     felt easy

Use it twice to retrieve the first result of the max results:

data %>%
  group_by(Exercise) %>%
  top_n(1, EstMax) %>%
  top_n(1, Date)

Selecting by Note
# A tibble: 2 x 8
# Groups:   Exercise [2]
    Date    Exercise Category Weight  Reps   EstMax RepxWeight        Note
  <fctr>      <fctr>   <fctr>  <int> <int>    <dbl>     <fctr>      <fctr>
1 4/9/16 Bench Press    Chest    135     2 143.9910      2x135 super tired
2 5/8/16    Deadlift     Legs    185     3 203.4815      3x185    good day

Note that this is taking the first result, not necessarily the earliest date. So you must arrange by date prior to using the second 'top_n':

data %>%
  group_by(Exercise) %>%
  top_n(1, EstMax) %>%
  mutate(Date = as.Date(Date, format = '%d/%m/%y')) %>%
  arrange(Date) %>%
  top_n(1)

Selecting by Note
# A tibble: 2 x 8
# Groups:   Exercise [2]
        Date    Exercise Category Weight  Reps   EstMax RepxWeight        Note
      <date>      <fctr>   <fctr>  <int> <int>    <dbl>     <fctr>      <fctr>
1 2016-09-04 Bench Press    Chest    135     2 143.9910      2x135 super tired
2 2016-08-05    Deadlift     Legs    185     3 203.4815      3x185    good day

[edit] misread the question slightly, here is a solution that provides the output you are asking for:

data %>%
  group_by(Date, Exercise) %>%
  top_n(1, EstMax) %>%
  top_n(1)

Selecting by Note
# A tibble: 6 x 8
# Groups:   Date, Exercise [6]
     Date    Exercise Category Weight  Reps   EstMax RepxWeight        Note
   <fctr>      <fctr>   <fctr>  <int> <int>    <dbl>     <fctr>      <fctr>
1  4/2/16    Deadlift     Legs    135     7 166.4685      7x135       tired
2  4/2/16 Bench Press    Chest    135     2 143.9910      2x135    not hard
3  4/9/16 Bench Press    Chest    135     2 143.9910      2x135 super tired
4 4/18/16    Deadlift     Legs    155     8 196.2920      8x155           …
5  5/8/16    Deadlift     Legs    185     3 203.4815      3x185    good day
6  5/8/16 Bench Press    Chest    115     4 130.3180      4x115        hard

This does not give the expected output – acylam Nov 21 '17 at 17:04 — acylam, Nov 21 '17 at 17:04

R: How to aggregate some columns while keeping other columns

4 Answers4

Linked