1

In the example below, the indices returned by the order function are used to sort the entries in each group by a :

set.seed(123)

ex.df <- data.frame(
  group = sample(LETTERS[1:4],20,replace=TRUE),
  score1 = sample(1:10),
  score2 = sample(1:10)
)

sortedOrderings <- by(ex.df, ex.df$group, function(df) order(df$score1 + df$score2) )

bestIndices <- lapply(sortedOrderings, FUN= function(lst) lst[1] )

The trouble is that order sees the indices of the data frame subsetted by by rather than ex.df itself, so using it to extract the relevant rows from the ex.df isn't the brightest idea:

print(sortedOrderings)

ex.df$group: A
[1] 2 3 4 1
--------------------------------------------------------------- 
ex.df$group: B
[1] 5 3 2 4 1
--------------------------------------------------------------- 
ex.df$group: C
[1] 2 1 3 4
--------------------------------------------------------------- 
ex.df$group: D
[1] 3 7 4 6 1 2 5

> print(ex.df[bestIndices,])
    group score1 score2
2       D      7      9
5       D      4      1
2.1     D      7      9
3       B      6      6

Is there a way to pull out the "best" row from each group in ex.df, or at least have the indices reference ex.df?

Zaid
  • 36,680
  • 16
  • 86
  • 155
  • you could use `ave` to do the *split- apply-combine* for you : `with(ex.df, ex.df[ (score1 + score2) == ave(score1 + score2, group, FUN=max), ])` (although I'd probably do the sum on a previous line - and maybe use rowSums) – user20650 Oct 04 '17 at 19:00
  • 1
    Actually from https://stackoverflow.com/questions/24558328/how-to-select-the-row-with-the-maximum-value-in-each-group ; you will probs find a few ways to do this : *data.table*: `library(data.table) ; dt = as.data.table(ex.df) dt[dt[, .I[(score1 + score2) == max(score1 + score2)], by=group]$V1]` , *dplyr* : `library(dplyr) ; ex.df %>% group_by(group) %>% top_n(1, score1+score2)`. If these are giving the expected outcome, it may be worth marking this as a dup?? – user20650 Oct 04 '17 at 19:12
  • 1
    I agree that this can be marked as a dupe – Zaid Oct 05 '17 at 07:31

2 Answers2

1

You can use the dplyr package and the rank function. It looks like this:

ex.df %>%
  mutate(total_score = score1 + score2) %>%
  group_by(group) %>%
  mutate(rank = rank(total_score)) %>%
  filter(rank == max(rank)) %>%
  select(-c(rank)) %>%
  arrange(group)

and gives you this:

# A tibble: 4 x 4
# Groups:   group [4]
   group score1 score2 total_score
  <fctr>  <int>  <int>       <int>
1      A      8      3          11
2      B      9     10          19
3      C     10      8          18
4      D      9     10          19
tbradley
  • 2,210
  • 11
  • 20
1

Using data.table to perform a self join on the indices of the first row where total score is equal to the max score by group:

set.seed(123)

ex.df <- data.frame(
  group = sample(LETTERS[1:4],20,replace=TRUE),
  score1 = sample(1:10),
  score2 = sample(1:10)
)

library(data.table)
setDT(ex.df)

ex.df[ex.df[,.I[(score1 + score2) == max(score1 + score2)][1],by = .(group)]$V1][order(group)]

Returns

   group score1 score2
1:     A      8      3
2:     B      9     10
3:     C     10      8
4:     D      9     10
Matt Summersgill
  • 4,054
  • 18
  • 47
  • @user20650 beat me to the punch on this one, only difference is that the extra `[1]` approximately mid-line is necessary to ensure that only one row per group is returned in the event of a tie. – Matt Summersgill Oct 04 '17 at 19:32
  • This works a treat. Thanks a bunch – Zaid Oct 05 '17 at 07:32