R: filter dataset based on unique columns

Question

Possible Duplicate:
R: Finding patterns across multiple columns- possibly duplicated()?

Dear all,

Here is a part of my dataset:

         name   chr     start      stop strand   alias 
60 uc003vqx.2  chr7 130835560 130891916      -   PODXL
61 uc003xlp.1  chr8  38387812  38445509      -     FLG
62 uc003xlu.1  chr8  38400008  38445509      -     FLG
63 uc003xlv.1  chr8  38400008  38445509      -     FLG
64 uc003xtz.1  chr8  61263976  61356508      -     CA8
65 uc003xua.1  chr8  61283183  61356508      -     CA8
66 uc010lwg.1  chr8  38387812  38445509      -     FLG
67 uc010lwh.1  chr8  38387812  38445509      -     FLG
68 uc010lwj.1  chr8  38387812  38445509      -     FLG

I would like to filter the dataset based on unique start,stop and alias column. The final result must be like this:

         name   chr     start      stop strand   alias 
60 uc003vqx.2  chr7 130835560 130891916      -   PODXL
61 uc003xlp.1  chr8  38387812  38445509      -     FLG
62 uc003xlu.1  chr8  38400008  38445509      -     FLG
64 uc003xtz.1  chr8  61263976  61356508      -     CA8
65 uc003xua.1  chr8  61283183  61356508      -     CA8
66 uc010lwg.1  chr8  38387812  38445509      -     FLG

Does anyone know if there is a solution for this? Thanks!

If I'm not mistaken, your desired results contains a duplicated row (i.e. 66 is the same as 62) — Andrie, May 19 '11 at 13:49
also : http://stackoverflow.com/questions/2626567/collapsing-data-frame-by-selecing-one-row-per-group , or http://stackoverflow.com/questions/1769365/how-to-remove-partial-duplicates-from-a-data-frame , or even http://stackoverflow.com/questions/2183002/display-only-one-line-for-each-na-value Using the search function of SO wouldn't hurt. — Joris Meys, May 19 '11 at 14:37

score 7 · Accepted Answer · answered May 19 '11 at 13:48

Use the duplicated function:

Replicate the data:

x <- "         name   chr     start      stop strand   alias 
60 uc003vqx.2  chr7 130835560 130891916      -   PODXL
61 uc003xlp.1  chr8  38387812  38445509      -     FLG
62 uc003xlu.1  chr8  38400008  38445509      -     FLG
63 uc003xlv.1  chr8  38400008  38445509      -     FLG
64 uc003xtz.1  chr8  61263976  61356508      -     CA8
65 uc003xua.1  chr8  61283183  61356508      -     CA8
66 uc010lwg.1  chr8  38387812  38445509      -     FLG
67 uc010lwh.1  chr8  38387812  38445509      -     FLG
68 uc010lwj.1  chr8  38387812  38445509      -     FLG"

dat <- read.table(textConnection(x), header=TRUE)

Remove duplicates:

dat[!duplicated(dat[, c("start", "stop", "alias")]), ]

         name  chr     start      stop strand alias
60 uc003vqx.2 chr7 130835560 130891916      - PODXL
61 uc003xlp.1 chr8  38387812  38445509      -   FLG
62 uc003xlu.1 chr8  38400008  38445509      -   FLG
64 uc003xtz.1 chr8  61263976  61356508      -   CA8
65 uc003xua.1 chr8  61283183  61356508      -   CA8

I used the duplicated function before, but I don't know that this is also possible. Thank! — Lisann, May 19 '11 at 13:53

score 1 · Answer 2 · answered May 19 '11 at 13:55

1

I think your example output is in error, Try

dfrm$comb <-  with(dfrm, paste(start,stop, alias, sep="+"))
dfrm[!duplicated(dfrm$comb), 1:6]
#---
         name  chr     start      stop strand alias
60 uc003vqx.2 chr7 130835560 130891916      - PODXL
61 uc003xlp.1 chr8  38387812  38445509      -   FLG
62 uc003xlu.1 chr8  38400008  38445509      -   FLG
64 uc003xtz.1 chr8  61263976  61356508      -   CA8
65 uc003xua.1 chr8  61283183  61356508      -   CA8

answered May 19 '11 at 13:55

IRTFM

258,963
21
364
487

Although this is a practical solution (and one I've used in the dreaded Excel many times) it should be possible to construct hypothetical data where this won't work. Imagine, for example, a dataset where each column consists of varying number of + symbols. – Andrie May 19 '11 at 14:03
Definitely. Your approach is much better. – IRTFM May 19 '11 at 19:49

R: filter dataset based on unique columns

2 Answers2