1

the data set looks like

  id  agent    final_col
1  1   A:A         A
2  1   A:A         A
3  2     B         B
4  3     C         C
5  4 A:C:C       A:C
6  4 A:C:C       A:C
7  4 A:C:C       A:C

How can I remove duplicate entries, to have a clean column like the final_col in R?

Shoaibkhanz
  • 1,942
  • 3
  • 24
  • 41
  • Do you want to remove only duplicates, or duplicate consecutives? Should A:C:A become A:C or A:C:A? In the second case, you could do `sapply(agent, function(x){y=unlist(strsplit(x,':'));paste(y[cumsum(rle(y)$length)],collapse=":")})` – Florian Jan 09 '18 at 13:58
  • 1
    remove duplicates, consecutive or not consecutive is not relevant, I believe I need to reword the question. – Shoaibkhanz Jan 09 '18 at 14:00

2 Answers2

4

Let's just generate a new column based on df$agent

df$final_col <- sapply(df$agent, function(txt){ 
    paste(unique(unlist(strsplit(txt, ":"))), collapse=":")
})

For each element we split by :, select unique elements, and again put them together.

storaged
  • 1,837
  • 20
  • 34
  • is `strsplit` from base r, for some reason, i am having trouble reproducing, it says Error in strsplit non character argument – Imran Ali Jan 09 '18 at 13:56
  • 1
    @ImranAli because `df$agent` is factor, wrap it with `as.character(df$agent)`. – zx8754 Jan 09 '18 at 13:57
1

You can do this with gsub and a regular expression

gsub("\\b(\\w+)(\\:\\1)+\\b", "\\1", DAT$agent)
[1] "A"   "A"   "B"   "C"   "A:C" "A:C" "A:C"

Your Data

DAT = read.table(text="  id  agent    final_col
1  1   A:A         A
2  1   A:A         A
3  2     B         B
4  3     C         C
5  4 A:C:C       A:C
6  4 A:C:C       A:C
7  4 A:C:C       A:C",
header=TRUE, stringsAsFactors=FALSE)
G5W
  • 36,531
  • 10
  • 47
  • 80