Filtering rows having similar ID based on high lowest p value

Question

I have a dataframe which I mapped it to the various genomic region which givens me peak and its respective genes. Now two peaks can be mapped to one genomic region given the distance which I end up like this

 Peak        annotation         ENSEMBL log2FoldChange         padj UP_DOWN
  Peak13361 Distal Intergenic ENSG00000000457       3.458416 1.429138e-03      UP
  Peak13362 Distal Intergenic ENSG00000000457       2.208152 3.153138e-10      UP
  Peak13356 Distal Intergenic ENSG00000000457      -2.092536 1.693891e-03    DOWN
  Peak13329 Distal Intergenic ENSG00000000460       3.862953 2.713778e-05      UP
  Peak13331 Distal Intergenic ENSG00000000460       2.535419 3.064567e-02      UP
   Peak2767          Promoter ENSG00000000938       2.664457 2.362797e-03      UP
   Peak2769 Distal Intergenic ENSG00000000938       1.588538 3.678620e-07      UP
   Peak2771 Distal Intergenic ENSG00000000938       1.818130 5.232734e-03      UP
   Peak2772 Distal Intergenic ENSG00000000938       1.800501 2.102107e-02      UP
 Peak15396 Distal Intergenic ENSG00000000971       1.577753 1.045814e-02      UP

For example from this first three peak

 Peak        annotation         ENSEMBL log2FoldChange         padj UP_DOWN
      Peak13361 Distal Intergenic ENSG00000000457       3.458416 1.429138e-03      UP
      Peak13362 Distal Intergenic ENSG00000000457       2.208152 3.153138e-10      UP
      Peak13356 Distal Intergenic ENSG00000000457      -2.092536 1.693891e-03    DOWN

I would like to choose this peak only which has the most significance

  Peak13362 Distal Intergenic ENSG00000000457       2.208152 3.153138e-10      UP

This is the logic i have to follow if one peak has multiple ENSEMBL ID I have to look for the one which has the hugest significance

Any suggestion or help would be really appreciated

Do you need `df %>% group_by(ENSEMBL) %>% slice(which.min(padj))` using `dplyr` ? — Ronak Shah, Jun 01 '21 at 12:18

glagla · Accepted Answer · 2021-06-01T12:39:13.007

1

Can't test it without a minimal reproducible example, but something around these lines should work:

subsetting = function(x, df){
  df2 = subset(df, Peak = x) # subsetting the rows corresponding to a specific Peak
  df2 = subset(df2, padj = min(padj)) # selecting the smallest padj
  return(df2)
}

sapply(unique(Peak), subsetting, df = df)

edited Jun 01 '21 at 12:39

answered Jun 01 '21 at 12:16

glagla

611
4
9

"Can't test it without a minimal reproducible example" due to some reason my dput() not working fine it giving the whole 40k lines instead of first 10 rows – PesKchan Jun 01 '21 at 12:20
this never thought to create a function .i tried to use it where do i put the data-frame which argument? – PesKchan Jun 01 '21 at 12:32
1

you don't necessarily need to use df as an argument, but it 's somehow better to do so.. I updated the answer – glagla Jun 01 '21 at 12:38

Filtering rows having similar ID based on high lowest p value

1 Answers1