0

I am trying to remove duplicate character from strings.

dput(test)
c("APAAAAAAAAAAAPAAPPAPAPAAAAAAAAAAAAAAAAAAAAAAAAPPAPAAAAAAPPAPAAAPAPAAAAP", 
"AAA", "P", "P", "A", "P", "P", "APPPPPA", "A", "P", "AA", "PP", 
"PPA", "P", "P", "A", "P", "APAP", "P", "PA")

I create one function to sort the string

strSort <- function(x)
  sapply(lapply(strsplit(x, NULL), sort), paste, collapse="")

Then i use gsub to remove consecutive characters

gsub("(.)\\1{2,}", "\\1", str_Sort(test))

This give out put as

gsub("(.)\\1{2,}", "\\1", strSort(test))
 [1] "AP"   "A"    "P"    "P"    "A"    "P"    "P"    "AAP"  "A"    "P"    "AA"   "PP"   "APP"  "P"    "P"    "A"    "P"    "AAPP" "P"    "AP"

Output should only have one A and/or one P.

shoonya
  • 292
  • 1
  • 10

3 Answers3

2

In the strsplit output, we need to use unique on the sorted elements

sapply(strsplit(test, ""), function(x) 
       paste(unique(sort(x)), collapse=""))
#[1] "AP" "A"  "P"  "P"  "A"  "P"  "P"  "AP" "A"  "P"  "A"  "P"  "AP" "P"  "P"  "A"  "P"  "AP" "P"  "AP"
akrun
  • 874,273
  • 37
  • 540
  • 662
2

Using regex you can do :

gsub('(?:(.)(?=(.*)\\1))', '', test, perl = TRUE)

#[1] "AP" "A"  "P"  "P"  "A"  "P"  "P"  "PA" "A"  "P"  "A"  "P"  "PA"
#[14] "P"  "P"  "A"  "P"  "AP" "P"  "PA"

The regex has been taken from here.

Ronak Shah
  • 377,200
  • 20
  • 156
  • 213
1

Here is another option using utf8ToInt + intToUtf8

> sapply(test, function(x) intToUtf8(sort(unique(utf8ToInt(x)))), USE.NAMES = FALSE)
 [1] "AP" "A"  "P"  "P"  "A"  "P"  "P"  "AP" "A"  "P"  "A"  "P"  "AP" "P"  "P" 
[16] "A"  "P"  "AP" "P"  "AP"
ThomasIsCoding
  • 96,636
  • 9
  • 24
  • 81