R - subset by group

Question

I have the following data.frame :

Name    Expr    FC  FDR
probe1  gene1   1   1.5 0.005
probe2  gene2   1   1.3 0.02
probe3  gene1   2   1.2 0.000003
probe4  gene4   3   1.8 0.5
probe5  gene5   4   1.9 0.00008
probe6  gene2   5   1.3 0.03
probe7  gene4   3   1.2 0.0001
probe8  gene8   10  1.3 0.01
probe9  gene8   11  1.8 0.01

There are many probes representing the same gene. I would like to select only 1 probe per gene based on the FDR, ie, select for each gene, the probe with smallest FDR. If 2 probes for the same gene have the same FDR, select one randomly. I would like to have the following data.frame as output :

Name    Expr    FC  FDR
probe3  gene1   2   1.2 0.000003
probe4  gene5   4   1.9 0.00008
probe6  gene4   3   1.2 0.0001
probe2  gene2   1   1.3 0.02
probe8  gene8   10  1.3 0.01

Thanks in advance for your help !

what have you tried so far? Where are you stuck? Please add your tries to the question. — Andre Elrico, Aug 10 '18 at 11:08
Seems you have 4 column names and 5 columns. Or FDR has two values? — AntoniosK, Aug 10 '18 at 11:08
Some useful info here: https://stackoverflow.com/questions/24237399/how-to-select-the-rows-with-maximum-values-in-each-group-with-dplyr — AntoniosK, Aug 10 '18 at 11:15
I believe non of the solutions are right: `"If 2 probes for the same gene have the same FDR, select one randomly"`. This is not covered I believe. — Andre Elrico, Aug 10 '18 at 12:44

Lennyy · Answer 1 · 2018-08-10T13:22:56.210

3

With dplyr you could do:

df <- read.table(text = "
Name    Expr    FC   FDR
probe1  gene1   1   1.5 0.005
probe2  gene2   1   1.3 0.02
probe3  gene1   2   1.2 0.000003
probe4  gene4   3   1.8 0.5
probe5  gene5   4   1.9 0.00008
probe6  gene2   5   1.3 0.03
probe7  gene4   3   1.2 0.0001
probe8  gene8   10  1.3 0.01
probe9  gene8   11  1.8 0.01", h= T)

library(dplyr)
df %>% 
  group_by(Name) %>% 
  slice(which.min(FDR))

UPDATE after Andre Elrico's comment:

You could do this if you want to sample 1 row in case of ties for minimum FDRs.

df %>% 
  group_by(Name) %>% 
  filter(FDR == min(FDR)) %>% 
  sample_n(1)

edited Aug 10 '18 at 13:22

answered Aug 10 '18 at 11:14

Lennyy

5,932
2
10
23

1

+1. You could throw in `tibble::rownames_to_column(.,"probe")` or `mutate(.,probe = rownames(.))` as a first step as you're losing this info. Also you could use `top_n(-1,FDR)` in place of your `filter` step. – moodymudskipper Aug 10 '18 at 15:16

Maurits Evers · Answer 2 · 2018-08-10T11:34:47.200

2

(I think the down-votes are a bit harsh here: OP provides sample data and a clear problem statement.)

There are many ways to achieve this, I encourage you to spend some time researching similar questions here on SO. You will learn a lot about aggregating/summarising data in general.

Here is a base R possibility using split

do.call(rbind, lapply(split(df, df$Name), function(df) df[which.min(df$FDR), ]))
#       Name Expr  FC   FDR
#gene1 gene1    2 1.2 3e-06
#gene2 gene2    1 1.3 2e-02
#gene4 gene4    3 1.2 1e-04
#gene5 gene5    4 1.9 8e-05
#gene8 gene8   10 1.3 1e-02

or using by (thanks @RuiBarradas)

do.call(rbind, by(df, df$Name, function(x) x[which.min(x$FDR), ]))
   #    Name Expr  FC   FDR
#gene1 gene1    2 1.2 3e-06
#gene2 gene2    1 1.3 2e-02
#gene4 gene4    3 1.2 1e-04
#gene5 gene5    4 1.9 8e-05
#gene8 gene8   10 1.3 1e-02

Sample data

df <- read.table(text =
    "Name    Expr    FC  FDR
probe1  gene1   1   1.5 0.005
probe2  gene2   1   1.3 0.02
probe3  gene1   2   1.2 0.000003
probe4  gene4   3   1.8 0.5
probe5  gene5   4   1.9 0.00008
probe6  gene2   5   1.3 0.03
probe7  gene4   3   1.2 0.0001
probe8  gene8   10  1.3 0.01
probe9  gene8   11  1.8 0.01", header = T)

edited Aug 10 '18 at 11:34

answered Aug 10 '18 at 11:13

Maurits Evers

49,617
4
47
68

1

Instead of `lapply/split` it's simpler to use `by`. Upvote. – Rui Barradas Aug 10 '18 at 11:24
1

But to my surprise `lapply/split` is a bit faster, both with the small example dataframe in the question and with one with 100K rows. – Rui Barradas Aug 10 '18 at 11:30
You mean to split by `Expr`,no? – s_baldur Aug 10 '18 at 11:30
@snoram No, I believe the `probe`s are row names. – Rui Barradas Aug 10 '18 at 11:31
@snoram Yes, I agree with Rui. That's how I interpreted OPs question as well. – Maurits Evers Aug 10 '18 at 11:32
@snoram It does. The only difference is the order. – Maurits Evers Aug 10 '18 at 11:35
1

My mistake. `dput()` seems on the way out. – s_baldur Aug 10 '18 at 11:36
@RuiBarradas Oh I always forget about `by`. Nice, I like `by`. Made an edit. Thanks for reminding. – Maurits Evers Aug 10 '18 at 11:36
@RuiBarradas probes are not row names, they are kind of unique ID and could be multiple for the same gene. I want to obtaine 1 probe per gene (the most significant one, ie, the one with smaller FDR) – PIFASTE Aug 10 '18 at 14:05
@PIFASTE What causes the confusion is that you show 4 column names and 5 columns. You should share an example of your data with dput(head(df)) or similar. – s_baldur Aug 10 '18 at 16:31

s_baldur · Answer 3 · 2018-08-10T11:43:39.160

A data.table solution:

library(data.table)
setDT(df)

df[order(FDR), .SD[1], Name]
    Expr   Name FC FC2   FDR
1: gene1 probe3  2 1.2 3e-06
2: gene5 probe5  4 1.9 8e-05
3: gene4 probe7  3 1.2 1e-04
4: gene8 probe8 10 1.3 1e-02
5: gene2 probe2  1 1.3 2e-02

Where:

df <- data.frame(
  Name = paste0("gene", c(1, 2, 4, 5, 8))[c(1, 2, 1, 3, 4, 2, 3, 5, 5)], 
  Expr = c(1L, 1L, 2L, 3L, 4L, 5L, 3L, 10L, 11L), 
  FC = c(1.5, 1.3, 1.2, 1.8, 1.9, 1.3, 1.2, 1.3, 1.8), 
  FDR = c(0.005, 0.02, 3e-06, 0.5, 8e-05, 0.03, 1e-04, 0.01, 0.01)
)

score 0 · Answer 4 · answered Aug 10 '18 at 13:44

0

Here is an option with ave from base R. This will also give the rows that have ties

df[with(df, ave(FDR, Name, FUN = min) == FDR),]

answered Aug 10 '18 at 13:44

akrun

874,273
37
540
662

R - subset by group

4 Answers4

Sample data