Max value of for each column for distinct rows of data frame in r

Question

Let’s say there is an ordered df with an ID column, and other columns containing numeral data, ordered by the last column.

ID <- c(123, 142, 21, 562, 36, 721, 847, 321)
A <- c(96, 83, 73, 47, 88, 65, 72, 67)
B <- c(72, 69, 88, 75, 63, 89, 48, 80)
C <- c(95, 94, 94, 94, 65, 81, 75, 75)
D <- c(63, 88, 89, 88, 89, 79, 88, 79)
Rating <- c(97, 95, 92, 87, 85, 83, 79, 77)
df <- data.frame(ID, A, B, C, D, Rating)
df
#   ID  A  B  C  D Rating
#1 123 96 72 95 63     97
#2 142 83 69 94 88     95
#3  21 73 88 94 89     92
#4 562 47 75 94 88     87
#5  36 88 63 65 89     85
#6 721 65 89 81 79     83
#7 847 72 48 75 88     79
#8 321 67 80 75 79     77

The aim is to get the max value for each group/column, with its ID, and each pair needs to be from a distinct row (unique ID). For two IDs with the same value for a column, pick the one with the better Rating.

What I did is use the apply() function to get the max from each column, extract the IDs that have that value, and join them all into a data frame. Because I was still missing an ID for the 4th column, I used an anti join to take out the previous IDs and repeated the process to get this data frame:

my_max <- data.frame(apply(df, 2, max))
A2 <- df[which(df$A == my_max[2,1]),]%>% dplyr::select(ID, A)
B2 <- df[which(df$B == my_max[3,1]),]%>% dplyr::select(ID, B)
C2 <- df[which(df$C == my_max[4,1]),]%>% dplyr::select(ID, C)
D2 <- df[which(df$D == my_max[5,1]),]%>% dplyr::select(ID, D)
all <- full_join(A2, B2, by='ID') %>% full_join(C2, by='ID') %>% full_join(D2, by='ID')
all <- all[-c(4),]
df <- anti_join(df, all, by='ID')
my_max <- data.frame(apply(df, 2, max))
C2 <- df[which(df$C == my_max[4,1]),]%>% dplyr::select(ID, C)
all <- all %>% full_join(C2, by='ID')
all <- all[-c(5),-c(4)]

To finally give me:

all
#   ID  A  B  D C.y
#1 123 96 NA NA  NA
#2 721 NA 89 NA  NA
#3  21 NA NA 89  NA
#4 142 NA NA NA  94

Is there a more clean or concise/efficient way of doing this? Not necessarily the same way, perhaps just the ID and role like:

#   ID  Group
#1 123    A
#2 721    B
#3 142    C
#4 21     D

you could simplify getting all the max's into a one liner `apply(df[,2:5],2,function(x) df$ID[which.max(x)]) ` You'll still run into the problem where IDs are repeated though. — Daniel O, Jun 02 '20 at 16:45
I don't understand your problem. You say you want "the max value for each group/column, with its ID, and each pair needs to be from a distinct row (unique ID)". So for column A, the max is 96 with ID 123, but for column C, the max is again ID 123. What determines whether ID 123 should be assigned to A or C? Can you please clarify what you want? That would suggest that you take the max group value by row first, but in that case, ID 23 should be assigned to C since that is the max value for that row..? — doubled, Jun 02 '20 at 16:51
We want to maximize the aggregate sum of their values, so the A+B+C+D will be the greatest value that can be (for different IDs each). So I pick 96 for A for that ID because it is greater than its value for C—95. — Ibrahim B., Jun 02 '20 at 16:58
if the C value is very low for all rows except the first row, would you still rather choose A for the first row instead of C? in other words, are you trying to maximize the sum of values A - D while choosing a set of unique IDs? — chinsoon12, Jun 02 '20 at 22:58

score 4 · Accepted Answer · answered Jun 02 '20 at 17:18

I see that some solutions do not handle duplicated IDs. For instance, we have ID 123 for both groups A and C.

To get an output similar to the final results in your question, another solution that handles duplicated IDs is the following

# initialization
variables <- c("A", "B", "C", "D")
df_max <- data.frame(ID = numeric(length(variables)), Group = variables)

for(column in variables){
  temp_id <- df %>% 
    filter(!(ID %in% df_max$ID)) %>% 
    arrange(desc(!!rlang::sym(column)), desc(Rating)) %>% 
    slice(1) %>% 
    select(ID) %>%
    as.numeric(ID)
  df_max[df_max$Group == column, "ID"] <- temp_id
}

Basically, the filter step makes sure that we do not consider the already selected IDs.

Output

# > df_max
#
#    ID Group
# 1 123     A
# 2 721     B
# 3 142     C
# 4  21     D

Allan Cameron · Answer 2 · 2020-06-02T17:08:35.737

Here's a dplyr solution that will handle duplicate IDs. First, we pivot_longer to get all the letters into a single column. Then we group_by these letters. Finally, within each letter, we sort by the value (and rating for ties in value) and select the first element to get each ID.

library(dplyr)

df %>% 
  pivot_longer(cols = c("A", "B", "C", "D")) %>% 
  group_by(Group = name) %>%
  summarise(ID = ID[order(-value, -Rating)[1]])
#> # A tibble: 4 x 2
#>   Group    ID
#>   <chr> <dbl>
#> 1 A       123
#> 2 B       721
#> 3 C       123
#> 4 D        21

Roland · Answer 3 · 2020-06-02T17:16:02.757

2

Another dplyr/purrr solution, less concise than Allan's.

find_max <- function(gg){
    tibble(
        group=gg, 
        ID= df %>% select(all_of(c(gg,"Rating","ID"))) %>% 
        arrange_all(desc) %>% slice(1) %>% pull(ID))
}

c("A","B","C","D") %>% map_dfr(find_max)

The idea is to use dplyr::arrange to sort the data frame by the group and Rating (descending), and then keep the first line (the maximum)'s ID. Iteration is made using purrr::map_dfr which directly produces a tibble.

The output is:

# A tibble: 4 x 2
  group    ID
  <chr> <dbl>
1 A       123
2 B       721
3 C       123
4 D        21

edited Jun 02 '20 at 17:16

answered Jun 02 '20 at 17:02

Roland

377
4
14

1

I wasn't sure about your output Roland, then I re-read the OP's question and realised I had grouped by the wrong variable. Like your answer too +1 – Allan Cameron Jun 02 '20 at 17:10

score 0 · Answer 4 · answered Jun 02 '20 at 17:04

Based on this answer and using dplyr:

df %>%
  group_by(ID) %>%
  mutate(max.val = pmax(A, B, C, D)[which.max(Rating)]) %>%
  summarise_each(list(max)) %>%
  mutate(top.col=apply(.[,2:5], 1, function(x) names(x)[which.max(x)])) %>%
  select(-c(A, B, C, D, Rating))

you get

# A tibble: 8 x 3
     ID max.val top.col
  <dbl>   <dbl> <chr>  
1    21      94 C      
2    36      89 D      
3   123      96 A      
4   142      94 C      
5   321      80 B      
6   562      94 C      
7   721      89 B      
8   847      88 D

Max value of for each column for distinct rows of data frame in r

4 Answers4