2

Possible Duplicate:
R: Finding patterns across multiple columns- possibly duplicated()?

Dear all,

Here is a part of my dataset:

         name   chr     start      stop strand   alias 
60 uc003vqx.2  chr7 130835560 130891916      -   PODXL
61 uc003xlp.1  chr8  38387812  38445509      -     FLG
62 uc003xlu.1  chr8  38400008  38445509      -     FLG
63 uc003xlv.1  chr8  38400008  38445509      -     FLG
64 uc003xtz.1  chr8  61263976  61356508      -     CA8
65 uc003xua.1  chr8  61283183  61356508      -     CA8
66 uc010lwg.1  chr8  38387812  38445509      -     FLG
67 uc010lwh.1  chr8  38387812  38445509      -     FLG
68 uc010lwj.1  chr8  38387812  38445509      -     FLG

I would like to filter the dataset based on unique start,stop and alias column. The final result must be like this:

         name   chr     start      stop strand   alias 
60 uc003vqx.2  chr7 130835560 130891916      -   PODXL
61 uc003xlp.1  chr8  38387812  38445509      -     FLG
62 uc003xlu.1  chr8  38400008  38445509      -     FLG
64 uc003xtz.1  chr8  61263976  61356508      -     CA8
65 uc003xua.1  chr8  61283183  61356508      -     CA8
66 uc010lwg.1  chr8  38387812  38445509      -     FLG

Does anyone know if there is a solution for this? Thanks!

Community
  • 1
  • 1
Lisann
  • 5,705
  • 14
  • 41
  • 50
  • If I'm not mistaken, your desired results contains a duplicated row (i.e. 66 is the same as 62) – Andrie May 19 '11 at 13:49
  • also : http://stackoverflow.com/questions/2626567/collapsing-data-frame-by-selecing-one-row-per-group , or http://stackoverflow.com/questions/1769365/how-to-remove-partial-duplicates-from-a-data-frame , or even http://stackoverflow.com/questions/2183002/display-only-one-line-for-each-na-value Using the search function of SO wouldn't hurt. – Joris Meys May 19 '11 at 14:37

2 Answers2

7

Use the duplicated function:

Replicate the data:

x <- "         name   chr     start      stop strand   alias 
60 uc003vqx.2  chr7 130835560 130891916      -   PODXL
61 uc003xlp.1  chr8  38387812  38445509      -     FLG
62 uc003xlu.1  chr8  38400008  38445509      -     FLG
63 uc003xlv.1  chr8  38400008  38445509      -     FLG
64 uc003xtz.1  chr8  61263976  61356508      -     CA8
65 uc003xua.1  chr8  61283183  61356508      -     CA8
66 uc010lwg.1  chr8  38387812  38445509      -     FLG
67 uc010lwh.1  chr8  38387812  38445509      -     FLG
68 uc010lwj.1  chr8  38387812  38445509      -     FLG"

dat <- read.table(textConnection(x), header=TRUE)

Remove duplicates:

dat[!duplicated(dat[, c("start", "stop", "alias")]), ]

         name  chr     start      stop strand alias
60 uc003vqx.2 chr7 130835560 130891916      - PODXL
61 uc003xlp.1 chr8  38387812  38445509      -   FLG
62 uc003xlu.1 chr8  38400008  38445509      -   FLG
64 uc003xtz.1 chr8  61263976  61356508      -   CA8
65 uc003xua.1 chr8  61283183  61356508      -   CA8
Andrie
  • 176,377
  • 47
  • 447
  • 496
  • 2
    I used the duplicated function before, but I don't know that this is also possible. Thank! – Lisann May 19 '11 at 13:53
1

I think your example output is in error, Try

dfrm$comb <-  with(dfrm, paste(start,stop, alias, sep="+"))
dfrm[!duplicated(dfrm$comb), 1:6]
#---
         name  chr     start      stop strand alias
60 uc003vqx.2 chr7 130835560 130891916      - PODXL
61 uc003xlp.1 chr8  38387812  38445509      -   FLG
62 uc003xlu.1 chr8  38400008  38445509      -   FLG
64 uc003xtz.1 chr8  61263976  61356508      -   CA8
65 uc003xua.1 chr8  61283183  61356508      -   CA8
IRTFM
  • 258,963
  • 21
  • 364
  • 487
  • Although this is a practical solution (and one I've used in the dreaded Excel many times) it should be possible to construct hypothetical data where this won't work. Imagine, for example, a dataset where each column consists of varying number of + symbols. – Andrie May 19 '11 at 14:03
  • Definitely. Your approach is much better. – IRTFM May 19 '11 at 19:49