keep only unique elements in string in r

Question

In genomics research, you often have many strings with duplicate gene names. I would like to find an efficient way to only keep the unique gene names in a string. This is an example that works. But, isn't it possible to do this in one step, i.e., without having to split the entire string and then having to past the unique elements back together?

genes <- c("GSTP1;GSTP1;APC")
a <- unlist(strsplit(genes, ";"))
paste(unique(a), collapse=";")
[1] "GSTP1;APC"

This just combines it into one line: `paste(unique(unlist(strsplit(genes, ";")),collapse=";")) `. — lmo, Jul 05 '16 at 18:45
I have seen this one on stack: http://stackoverflow.com/questions/20283624/removing-duplicate-words-in-a-string-in-r — Eric Lecoutre, Jul 05 '16 at 18:48
I will be really surprised if you"ll find anything better. Except maybe adding `fixed = TRUE` to `strsplit` for efficiency gain. There is also `stringi::stri_unique` that claims to be more suited for NLP than `base::unique` (but much slower too). — David Arenburg, Jul 05 '16 at 18:48
You can write yourself a function that does these two pieces... — Gregor Thomas, Jul 05 '16 at 19:22

score 1 · Answer 1 · 2016-07-07T07:26:25.627

1

An alternative is doing

unique(unlist(strsplit(genes, ";")))
#[1] "GSTP1" "APC"

Then this should give you the answer

paste(unique(unlist(strsplit(genes, ";"))), collapse = ";")
#[1] "GSTP1;APC"

edited Jul 07 '16 at 07:26

answered Jul 06 '16 at 08:10

Thanks, but I need to keep the unique gene names in the same string, seperated by ';'. – milan Jul 06 '16 at 17:24
@milan look at updated version , it gives you the exact output you like – Jul 07 '16 at 07:26

score 0 · Answer 2 · answered Jul 06 '16 at 03:38

0

Based on the example showed, perhaps

gsub("(\\w+);\\1", "\\1", genes)
#[1] "GSTP1;APC"

answered Jul 06 '16 at 03:38

akrun

874,273
37
540
662

Thanks. It does work for this example, but it won't work if you have a slightly different string: c("A", "B", "A"). – milan Jul 06 '16 at 17:26

keep only unique elements in string in r

2 Answers2

Linked

Related