How to remove duplicate consecutive text in R separated by :

Question

the data set looks like

  id  agent    final_col
1  1   A:A         A
2  1   A:A         A
3  2     B         B
4  3     C         C
5  4 A:C:C       A:C
6  4 A:C:C       A:C
7  4 A:C:C       A:C

How can I remove duplicate entries, to have a clean column like the final_col in R?

Do you want to remove only duplicates, or duplicate consecutives? Should A:C:A become A:C or A:C:A? In the second case, you could do `sapply(agent, function(x){y=unlist(strsplit(x,':'));paste(y[cumsum(rle(y)$length)],collapse=":")})` — Florian, Jan 09 '18 at 13:58
remove duplicates, consecutive or not consecutive is not relevant, I believe I need to reword the question. — Shoaibkhanz, Jan 09 '18 at 14:00

score 4 · Answer 1 · answered Jan 09 '18 at 13:47

4

Let's just generate a new column based on df$agent

df$final_col <- sapply(df$agent, function(txt){ 
    paste(unique(unlist(strsplit(txt, ":"))), collapse=":")
})

For each element we split by :, select unique elements, and again put them together.

answered Jan 09 '18 at 13:47

storaged

1,837
20
34

is `strsplit` from base r, for some reason, i am having trouble reproducing, it says Error in strsplit non character argument – Imran Ali Jan 09 '18 at 13:56
1

@ImranAli because `df$agent` is factor, wrap it with `as.character(df$agent)`. – zx8754 Jan 09 '18 at 13:57

score 1 · Accepted Answer · answered Jan 09 '18 at 13:51

You can do this with gsub and a regular expression

gsub("\\b(\\w+)(\\:\\1)+\\b", "\\1", DAT$agent)
[1] "A"   "A"   "B"   "C"   "A:C" "A:C" "A:C"

Your Data

DAT = read.table(text="  id  agent    final_col
1  1   A:A         A
2  1   A:A         A
3  2     B         B
4  3     C         C
5  4 A:C:C       A:C
6  4 A:C:C       A:C
7  4 A:C:C       A:C",
header=TRUE, stringsAsFactors=FALSE)

How to remove duplicate consecutive text in R separated by :

2 Answers2

Your Data