R: how to filter through a data.frame to delete data entries

Question

I have a data.frame, the start of it is below:

        gene            snp     pval   best_snp    best_pval
1  ENSG00000007341  rs2932538  5.6007 rs17030613   10.0542
2  ENSG00000064419 rs10488631  7.7461  rs4728142   24.6101
3  ENSG00000064419 rs12531711  7.7449  rs4728142   24.6101
4  ENSG00000064419 rs12537284  4.5544  rs4728142   24.6101
5  ENSG00000064666  rs3764650 12.3401  rs3752246    5.4001
6  ENSG00000072682 rs10479002  5.0141 rs12521868   21.1550

As shown, in lines 2-4 the same gene is repeated. For genes that are repeated, I only want to keep the best_snp and best_pval values for the first row that the gene first appears, so row 2; and for row 3&4 I want to delete the best_snp and best_pval values, since it's the same as above.

If a gene is not repeated, then just leave it as it is.

Please keep in mind that the table is much larger than shown and the genes are repeated at random places.

score 1 · Answer 1 · edited Jul 02 '13 at 10:11

1

I'm assuming by table, you mean data.frame. If so, and if z is your data.frame:

z[match(unique(z$Best_SNP), z$Best_SNP),]

Based on Arun's answer and links to your other question. It sounds like you actually want to keep the rows but replace the duplicates with something (like NA?), which could be done with:

z2 <- z
z2[duplicated(z2$Best_SNP),c("Best_SNP","Best_Pval")] <- NA

edited Jul 02 '13 at 10:11

Arun

116,683
26
284
387

answered Jul 02 '13 at 09:49

Thomas

43,637
12
109
140

hi thanks. but this deletes the lines with the gene repeated, so it deleted rows 3&4. I want to keep ALL the rows, but just delete the best_snp and best_pvalues for the rows with the repeated gene. – zfz Jul 02 '13 at 09:57
See update. I've added a solution where they're replaced by `NA`...but that value could be anything. – Thomas Jul 02 '13 at 09:57

score 1 · Accepted Answer · answered Jul 02 '13 at 10:12

1

if df is your data.frame:

library(plyr)
ddply(df,.(gene),function(x) {x[-1,c("best_snp","best_pval")] <- NA
return(x)})

answered Jul 02 '13 at 10:12

gd047

29,749
18
107
146

The assignment here is attempted for every group which may not be efficient when there are too many groups (out of which there are only a few duplicated entries). – Arun Jul 03 '13 at 20:22

R: how to filter through a data.frame to delete data entries

2 Answers2