How can I extract the "best" row from each sub-category in my data frame?

Question

In the example below, the indices returned by the order function are used to sort the entries in each group by a :

set.seed(123)

ex.df <- data.frame(
  group = sample(LETTERS[1:4],20,replace=TRUE),
  score1 = sample(1:10),
  score2 = sample(1:10)
)

sortedOrderings <- by(ex.df, ex.df$group, function(df) order(df$score1 + df$score2) )

bestIndices <- lapply(sortedOrderings, FUN= function(lst) lst[1] )

The trouble is that order sees the indices of the data frame subsetted by by rather than ex.df itself, so using it to extract the relevant rows from the ex.df isn't the brightest idea:

print(sortedOrderings)

ex.df$group: A
[1] 2 3 4 1
--------------------------------------------------------------- 
ex.df$group: B
[1] 5 3 2 4 1
--------------------------------------------------------------- 
ex.df$group: C
[1] 2 1 3 4
--------------------------------------------------------------- 
ex.df$group: D
[1] 3 7 4 6 1 2 5

> print(ex.df[bestIndices,])
    group score1 score2
2       D      7      9
5       D      4      1
2.1     D      7      9
3       B      6      6

Is there a way to pull out the "best" row from each group in ex.df, or at least have the indices reference ex.df?

you could use `ave` to do the *split- apply-combine* for you : `with(ex.df, ex.df[ (score1 + score2) == ave(score1 + score2, group, FUN=max), ])` (although I'd probably do the sum on a previous line - and maybe use rowSums) — user20650, Oct 04 '17 at 19:00
Actually from https://stackoverflow.com/questions/24558328/how-to-select-the-row-with-the-maximum-value-in-each-group ; you will probs find a few ways to do this : *data.table*: `library(data.table) ; dt = as.data.table(ex.df) dt[dt[, .I[(score1 + score2) == max(score1 + score2)], by=group]$V1]` , *dplyr* : `library(dplyr) ; ex.df %>% group_by(group) %>% top_n(1, score1+score2)`. If these are giving the expected outcome, it may be worth marking this as a dup?? — user20650, Oct 04 '17 at 19:12

score 1 · Answer 1 · answered Oct 04 '17 at 18:02

You can use the dplyr package and the rank function. It looks like this:

ex.df %>%
  mutate(total_score = score1 + score2) %>%
  group_by(group) %>%
  mutate(rank = rank(total_score)) %>%
  filter(rank == max(rank)) %>%
  select(-c(rank)) %>%
  arrange(group)

and gives you this:

# A tibble: 4 x 4
# Groups:   group [4]
   group score1 score2 total_score
  <fctr>  <int>  <int>       <int>
1      A      8      3          11
2      B      9     10          19
3      C     10      8          18
4      D      9     10          19

score 1 · Accepted Answer · answered Oct 04 '17 at 19:15

1

Using data.table to perform a self join on the indices of the first row where total score is equal to the max score by group:

set.seed(123)

ex.df <- data.frame(
  group = sample(LETTERS[1:4],20,replace=TRUE),
  score1 = sample(1:10),
  score2 = sample(1:10)
)

library(data.table)
setDT(ex.df)

ex.df[ex.df[,.I[(score1 + score2) == max(score1 + score2)][1],by = .(group)]$V1][order(group)]

Returns

   group score1 score2
1:     A      8      3
2:     B      9     10
3:     C     10      8
4:     D      9     10

answered Oct 04 '17 at 19:15

Matt Summersgill

4,054
18
47

@user20650 beat me to the punch on this one, only difference is that the extra `[1]` approximately mid-line is necessary to ensure that only one row per group is returned in the event of a tie. – Matt Summersgill Oct 04 '17 at 19:32
This works a treat. Thanks a bunch – Zaid Oct 05 '17 at 07:32

How can I extract the "best" row from each sub-category in my data frame?

2 Answers2