1

I have the following data.frame :

Name    Expr    FC  FDR
probe1  gene1   1   1.5 0.005
probe2  gene2   1   1.3 0.02
probe3  gene1   2   1.2 0.000003
probe4  gene4   3   1.8 0.5
probe5  gene5   4   1.9 0.00008
probe6  gene2   5   1.3 0.03
probe7  gene4   3   1.2 0.0001
probe8  gene8   10  1.3 0.01
probe9  gene8   11  1.8 0.01

There are many probes representing the same gene. I would like to select only 1 probe per gene based on the FDR, ie, select for each gene, the probe with smallest FDR. If 2 probes for the same gene have the same FDR, select one randomly. I would like to have the following data.frame as output :

Name    Expr    FC  FDR
probe3  gene1   2   1.2 0.000003
probe4  gene5   4   1.9 0.00008
probe6  gene4   3   1.2 0.0001
probe2  gene2   1   1.3 0.02
probe8  gene8   10  1.3 0.01

Thanks in advance for your help !

PIFASTE
  • 87
  • 2
  • 8

4 Answers4

3

With dplyr you could do:

df <- read.table(text = "
Name    Expr    FC   FDR
probe1  gene1   1   1.5 0.005
probe2  gene2   1   1.3 0.02
probe3  gene1   2   1.2 0.000003
probe4  gene4   3   1.8 0.5
probe5  gene5   4   1.9 0.00008
probe6  gene2   5   1.3 0.03
probe7  gene4   3   1.2 0.0001
probe8  gene8   10  1.3 0.01
probe9  gene8   11  1.8 0.01", h= T)

library(dplyr)
df %>% 
  group_by(Name) %>% 
  slice(which.min(FDR))

UPDATE after Andre Elrico's comment:

You could do this if you want to sample 1 row in case of ties for minimum FDRs.

df %>% 
  group_by(Name) %>% 
  filter(FDR == min(FDR)) %>% 
  sample_n(1)
Lennyy
  • 5,932
  • 2
  • 10
  • 23
  • 1
    +1. You could throw in `tibble::rownames_to_column(.,"probe")` or `mutate(.,probe = rownames(.))` as a first step as you're losing this info. Also you could use `top_n(-1,FDR)` in place of your `filter` step. – moodymudskipper Aug 10 '18 at 15:16
2

(I think the down-votes are a bit harsh here: OP provides sample data and a clear problem statement.)

There are many ways to achieve this, I encourage you to spend some time researching similar questions here on SO. You will learn a lot about aggregating/summarising data in general.

Here is a base R possibility using split

do.call(rbind, lapply(split(df, df$Name), function(df) df[which.min(df$FDR), ]))
#       Name Expr  FC   FDR
#gene1 gene1    2 1.2 3e-06
#gene2 gene2    1 1.3 2e-02
#gene4 gene4    3 1.2 1e-04
#gene5 gene5    4 1.9 8e-05
#gene8 gene8   10 1.3 1e-02

or using by (thanks @RuiBarradas)

do.call(rbind, by(df, df$Name, function(x) x[which.min(x$FDR), ]))
   #    Name Expr  FC   FDR
#gene1 gene1    2 1.2 3e-06
#gene2 gene2    1 1.3 2e-02
#gene4 gene4    3 1.2 1e-04
#gene5 gene5    4 1.9 8e-05
#gene8 gene8   10 1.3 1e-02

Sample data

df <- read.table(text =
    "Name    Expr    FC  FDR
probe1  gene1   1   1.5 0.005
probe2  gene2   1   1.3 0.02
probe3  gene1   2   1.2 0.000003
probe4  gene4   3   1.8 0.5
probe5  gene5   4   1.9 0.00008
probe6  gene2   5   1.3 0.03
probe7  gene4   3   1.2 0.0001
probe8  gene8   10  1.3 0.01
probe9  gene8   11  1.8 0.01", header = T)
Maurits Evers
  • 49,617
  • 4
  • 47
  • 68
1

A data.table solution:

library(data.table)
setDT(df)

df[order(FDR), .SD[1], Name]
    Expr   Name FC FC2   FDR
1: gene1 probe3  2 1.2 3e-06
2: gene5 probe5  4 1.9 8e-05
3: gene4 probe7  3 1.2 1e-04
4: gene8 probe8 10 1.3 1e-02
5: gene2 probe2  1 1.3 2e-02

Where:

df <- data.frame(
  Name = paste0("gene", c(1, 2, 4, 5, 8))[c(1, 2, 1, 3, 4, 2, 3, 5, 5)], 
  Expr = c(1L, 1L, 2L, 3L, 4L, 5L, 3L, 10L, 11L), 
  FC = c(1.5, 1.3, 1.2, 1.8, 1.9, 1.3, 1.2, 1.3, 1.8), 
  FDR = c(0.005, 0.02, 3e-06, 0.5, 8e-05, 0.03, 1e-04, 0.01, 0.01)
) 
s_baldur
  • 29,441
  • 4
  • 36
  • 69
0

Here is an option with ave from base R. This will also give the rows that have ties

df[with(df, ave(FDR, Name, FUN = min) == FDR),]
akrun
  • 874,273
  • 37
  • 540
  • 662