1

I have a data frame which is in .csv format. This data frame includes 34500 rows. In this file, list of a RNAseq analysis result is present. Here the problem is some genes have multiple results and I should pick 1 entry for each gene and this entry should have the most p value. I edited my data and I have just "Gene symbol" and "p value" information.

How can i remove/eliminate rows which includes genes that should be eliminated according to my rule. I will add a screenshot which shows my problem.

Thanks in advance.

RNF144A, TTTY14, TAS2R8, KIAA0355, GCNT2 are examples of problem.

NelsonGon
  • 13,015
  • 7
  • 27
  • 57
Melih O.
  • 45
  • 4
  • 3
    Please add your data with `dput`. Use `dput(head(df,n))` not **images**. Also include sample code and what your rule is. – NelsonGon Aug 05 '19 at 12:44
  • I could not write any code,so i did not add. My rule is to eliminate rows, which belong to the genes that have multiple entries, the row with the most p value should remain and the other entries should be eliminated. – Melih O. Aug 05 '19 at 12:50
  • 1
    OK, add your comment to your post and add data as suggested above or make a dummy data set. Include a sample of the expected output too. – NelsonGon Aug 05 '19 at 12:51
  • 1
    Related, possible duplicate https://stackoverflow.com/q/24070714/680068 – zx8754 Aug 05 '19 at 13:18

1 Answers1

1

Assuming that the blanks ("") correspond to repeat entries from the previous non-blank "Gene", change the blanks to NA (na_if), then use fill to change the NA to previous non-NA value, grouped by 'Gene', get the row with the max value for 'pvalue'

library(dplyr)
library(tidyr)
df1 %>%
    mutate(Gene = na_if(Gene, "")) %>%
    fill(Gene) %>%
    group_by(Gene) %>%
    slice(which.max(pvalue))
akrun
  • 874,273
  • 37
  • 540
  • 662
  • Thanks, the solution looks what i look for but when i tried to write R could not find fill function. I install dplyr package succesfully but it does not work. – Melih O. Aug 05 '19 at 13:24
  • @MelihO. You an just assign to a object `df1 <- df1 %>% mutate(...` or to a new one `df2 <- df1 %>% mutate(..` – akrun Aug 05 '19 at 13:31