0

I am trying to select the maximum value in a dataframe's third column based on the combinations of the values in the first two columns.

My problem is similar to this one but I can't find a way to implement what I need.

EDIT: Sample data changed to make the column names more obvious.

Here is some sample data:

library(tidyr)
set.seed(1234)
df <- data.frame(group1 = letters[1:4], group2 = letters[1:4])
df <- df %>% expand(group1, group2)
df <- subset(df, subset = group1!=group2)
df$score <- runif(n = 12,min = 0,max = 1)
df

    # A tibble: 12 × 3
   group1 group2       score
   <fctr> <fctr>       <dbl>
1       a      b 0.113703411
2       a      c 0.622299405
3       a      d 0.609274733
4       b      a 0.623379442
5       b      c 0.860915384
6       b      d 0.640310605
7       c      a 0.009495756
8       c      b 0.232550506
9       c      d 0.666083758
10      d      a 0.514251141
11      d      b 0.693591292
12      d      c 0.544974836

In this example rows 1 and 4 are 'duplicates'. I would like to select row 4 as the value in the score column is larger than in row 1. Ultimately I would like a dataframe to be returned with the group1 and group2 columns and the maximum value in the score column. So in this example, I expect there to be 6 rows returned.

How can I do this in R?

Community
  • 1
  • 1
meenaparam
  • 1,949
  • 2
  • 17
  • 29
  • 2
    df %>% group_by(a,b) %>% summarise(score=max(score)) – HubertL Mar 24 '17 at 21:02
  • @HubertL I had tried that before posting the question but group_by treats the factors in sequence so that means rows 1 and 4 in the example would still be considered as different groups, when I want them to be considered as the same group. – meenaparam Mar 25 '17 at 12:19

1 Answers1

0

I'd prefer dealing with this problem in two steps:

library(dplyr)

# Create function for computing group IDs from data frame of groups (per column)
get_group_id <- function(groups) {
  apply(groups, 1, function(row) {
    paste0(sort(row), collapse = "_")
  })
}
group_id <- get_group_id(select(df, -score))

# Perform the computation
df %>%
  mutate(groupId = group_id) %>%
  group_by(groupId) %>%
  slice(which.max(score)) %>%
  ungroup() %>%
  select(-groupId)
echasnovski
  • 1,161
  • 8
  • 13
  • Oops, actually this doesn't quite do what I need. The part at the end, with the `separate` function, sometimes reallocates the `group1` value to `group2` and viceversa. See for yourself when running your code on the sample data. The first row returned should show `b` in `group1` and `a` in `group2`, but your suggestion puts them the other way around. Any ideas? – meenaparam Mar 27 '17 at 14:16
  • Updated answer. It wasn't quite clear to me that you want to preserve order of grouping variables. – echasnovski Mar 27 '17 at 17:51
  • Sorry not to be clearer. Thanks very much for the updated answer, this does the trick. I will definitely look into more of the functions offered by `dplyr` as it's been really useful here. – meenaparam Mar 28 '17 at 09:11