0

In genomics research, you often have many strings with duplicate gene names. I would like to find an efficient way to only keep the unique gene names in a string. This is an example that works. But, isn't it possible to do this in one step, i.e., without having to split the entire string and then having to past the unique elements back together?

genes <- c("GSTP1;GSTP1;APC")
a <- unlist(strsplit(genes, ";"))
paste(unique(a), collapse=";")
[1] "GSTP1;APC"
milan
  • 4,782
  • 2
  • 21
  • 39
  • This just combines it into one line: `paste(unique(unlist(strsplit(genes, ";")),collapse=";")) `. – lmo Jul 05 '16 at 18:45
  • I have seen this one on stack: http://stackoverflow.com/questions/20283624/removing-duplicate-words-in-a-string-in-r – Eric Lecoutre Jul 05 '16 at 18:48
  • 3
    I will be really surprised if you"ll find anything better. Except maybe adding `fixed = TRUE` to `strsplit` for efficiency gain. There is also `stringi::stri_unique` that claims to be more suited for NLP than `base::unique` (but much slower too). – David Arenburg Jul 05 '16 at 18:48
  • You can write yourself a function that does these two pieces... – Gregor Thomas Jul 05 '16 at 19:22

2 Answers2

1

An alternative is doing

unique(unlist(strsplit(genes, ";")))
#[1] "GSTP1" "APC"

Then this should give you the answer

paste(unique(unlist(strsplit(genes, ";"))), collapse = ";")
#[1] "GSTP1;APC"
  • Thanks, but I need to keep the unique gene names in the same string, seperated by ';'. – milan Jul 06 '16 at 17:24
  • @milan look at updated version , it gives you the exact output you like –  Jul 07 '16 at 07:26
0

Based on the example showed, perhaps

gsub("(\\w+);\\1", "\\1", genes)
#[1] "GSTP1;APC"
akrun
  • 874,273
  • 37
  • 540
  • 662
  • Thanks. It does work for this example, but it won't work if you have a slightly different string: c("A", "B", "A"). – milan Jul 06 '16 at 17:26