How to select highest valued row by group in r?

Question

I have a dataset of genes in groups called 'loci' I am looking to select the gene with the higest score compared against only the genes in the same loci/group

My input data looks like this:

    loci Gene     Score
1:    1  AQP11   0.5566507
2:    1 CLNS1A   0.2811747
3:    1   RSF1   0.5269924
4:    2  CFDP1   0.4186066
5:    2  CHST6   0.5395135

My output would select the gene for group/loci 1 that has the highest score out of the 3 genes in loci 1, then also the gene with the highest score when compared with only the other gene in group 2.

So the output from this example I'm trying to get is:

     loci  Gene     Score
1:    1    AQP11   0.5566507 #highest score in loci 1
2:    2    CHST6   0.5395135 #highest score in loci 2

How can I filter for highest score by row groupings? I'm not sure where to start with this.

Input data:

structure(list(loci = c(1L, 1L, 1L, 2L, 2L), Gene = c("AQP11", 
"CLNS1A", "RSF1", "CFDP1", "CHST6"), Score = c(0.556650698184967, 
0.281174659729004, 0.526992380619049, 0.418606609106064, 0.539513528347015
)), row.names = c(NA, -5L), class = c("data.table", "data.frame"
))

I've been trying something with dplyr with dplyr::group_by() but I keep getting various errors.

If there is a tie, do you want all genes, or randomly pick one? — Michael Dewar, Oct 22 '20 at 13:02
Very good point, I hadn't considered, it shouldn't happen, but if it does I would want all genes — DN1, Oct 22 '20 at 13:04

score 1 · Answer 1 · answered Oct 22 '20 at 12:59

1

Using dplyr:

> library(dplyr)
> df %>% group_by(loci) %>% filter(Score == max(Score))
# A tibble: 2 x 3
# Groups:   loci [2]
   loci Gene  Score
  <dbl> <chr> <dbl>
1     1 AQP11 0.557
2     2 CHST6 0.540

answered Oct 22 '20 at 12:59

Karthik S

11,348
2
11
25

score 1 · Answer 2 · answered Oct 22 '20 at 13:07

1

In data.table:

library(data.table)
setDT(df)
df[, .SD[which.max(Score)], by = loci]

   loci  Gene     Score
1:    1 AQP11 0.5566507
2:    2 CHST6 0.5395135

answered Oct 22 '20 at 13:07

s_baldur

29,441
4
36
69

score 1 · Answer 3 · answered Oct 22 '20 at 13:10

A base R option using subset

subset(dt,ave(Score,loci,FUN = max)==Score)

giving

   loci  Gene     Score
1:    1 AQP11 0.5566507
2:    2 CHST6 0.5395135

Another base R option using aggregate

aggregate(.~loci,dt[with(dt,order(-Score,loci)),],head,1)

giving

  loci  Gene             Score
1    1 AQP11 0.556650698184967
2    2 CHST6 0.539513528347015

How to select highest valued row by group in r?

3 Answers3