1

My question is related to Extract the maximum value within each group in a dataframe.

  • The question there is essentially: how do you select the max value in one column based on repeated groups in a separate column in the same data frame?

  • In that post, user EDi provides a ton of examples for how to accomplish this task.

My question: how do I accomplish the same task, but instead of reporting the max value, I instead report a value in a third column associated with that max value?

For example:

  • Assume I have a data.frame:

    Group  Value   Year
    A      12      1933
    A      10      2010
    B      3       1935
    B      5       1978
    B      6       2011
    C      1       1954
    D      3       1933
    D      4       1978
    
  • For each level of my grouping variable, I wish to extract the year that the maximum value occurred. The result should thus be a data frame with one row per level of the grouping variable:

    Group  Year
    A      1933
    B      2011
    C      1954
    D      1978
    

I know I could use any of the answers from EDi's post mentioned above and then just use something like which,match or sapply to figure out the year, but that seems too sloppy.

Is there a quick way to extract a value in column A given the maximum value in column B within each group (column C) in a dataframe?

Update: Could someone please provide a base R solution?

Community
  • 1
  • 1
theforestecologist
  • 4,667
  • 5
  • 54
  • 91

2 Answers2

2
library(dplyr)
df %>% group_by(Group) %>% slice(which.max(Value)) %>% select(-Value)

#Source: local data frame [4 x 2]
#Groups: Group [4]

#   Group  Year
#  <fctr> <int>
#1      A  1933
#2      B  2011
#3      C  1954
#4      D  1978

Note this only keeps one max value per group if ties exist.


A method that keeps tied max values:

library(dplyr)
df %>% group_by(Group) %>% filter(Value == max(Value)) %>% select(-Value)

#Source: local data frame [4 x 2]
#Groups: Group [4]

#   Group  Year
#  <fctr> <int>
#1      A  1933
#2      B  2011
#3      C  1954
#4      D  1978
Psidom
  • 209,562
  • 33
  • 339
  • 356
1

Here is a base R and a data.table solution:

df <- structure(list(Group = c("A", "A", "B", "B", "B", "C", "D", "D"
), Value = c(12L, 10L, 3L, 5L, 6L, 1L, 3L, 4L), Year = c(1933L, 
2010L, 1935L, 1978L, 2011L, 1954L, 1933L, 1978L)), .Names = c("Group", 
"Value", "Year"), row.names = c(NA, -8L), class = "data.frame")

# Base R - use aggregate to get max Value per group, then merge with df
merge(df, aggregate(Value ~ Group, df, max), by.all = c("Group", "Value"))[
  , c("Group", "Year")]

#   Group Year
# 1     A 1933
# 2     B 2011
# 3     C 1954
# 4     D 1978

# data.table solution
library(data.table)
dt <- data.table(df)
dt[, .SD[which.max(Value), .(Year)], by = Group]

#    Group Year
# 1:     A 1933
# 2:     B 2011
# 3:     C 1954
# 4:     D 1978