Removing duplicate words without space in a vector in R

Question

I can't figure out how to write a regular expression in R that would "match" only duplicate consecutive words without space such as:

Country<-c("GhanaGhana","Bangladesh","Pakistan","Mongolia","India",
"Indonesia","UgandaUganda","ArmeniaArmenia","Sri LankaSri Lanka",
"U.S. Virgin IslandsU.S. Virgin Islands")

and transform into this:

Country<-c("Ghana","Bangladesh","Pakistan","Mongolia","India","Indonesia",
           "Uganda","Armenia","Sri Lanka","U.S. Virgin Islands")

Traditional R function for this like anyDuplicated() or unique() is not working. Is there a way to write a regular expression for this in R?

Another approach would be to write a function to check words with even nchar by splitting them in the middle and in case of duplicates only keep one of the two. — user12728748, Jun 23 '20 at 13:28
I agree with @user12728748 - this would not be easy to do with regular expressions. — Kylie R., Jun 23 '20 at 13:30

Gregor Thomas · Answer 1 · 2020-06-23T14:06:27.383

stringr::str_replace(Country, "^(.*)\\1$", "\\1")
# [1] "Ghana"               "Bangladesh"          "Pakistan"            "Mongolia"            "India"              
# [6] "Indonesia"           "Uganda"              "Armenia"             "Sri Lanka"           "U.S. Virgin Islands"

## or in base (using perl = TRUE for efficiency)
sub("^(.*)\\1$", "\\1", Country, perl = TRUE)
## same result

In general, the pattern "^(.*)\\1$" matches a string that is entirely a repeated sequence - (.*) creates a matching group, and \\1 refers to the first matching group (just learned that using \\1 in the pattern itself is possible, thanks to this answer). We replace the entire repeated string with the first matching group.

If the strings are long, do not rely on the default TRE library. Use PCRE then, `sub("^(.*)\\1$", "\\1", Country, perl=TRUE)` — Wiktor Stribiżew, Jun 23 '20 at 14:00

score 4 · Answer 2 · answered Jun 23 '20 at 13:35

You can compare the first half of the string against the second half, if the are the same then cut it in half

DoubledStrings <- substring(Country, 1, nchar(Country)/2 ) == substring(Country, nchar(Country)/2+1, nchar(Country))
Country[DoubledStrings] <- substring(Country, 1, nchar(Country)/2 )[DoubledStrings]

> Country
 [1] "Ghana"               "Bangladesh"          "Pakistan"           
 [4] "Mongolia"            "India"               "Indonesia"          
 [7] "Uganda"              "Armenia"             "Sri Lanka"          
[10] "U.S. Virgin Islands"

If the strings are at all long, I wouldn't be surprised if this answer quickly becomes much more efficient than regex. — Gregor Thomas, Jun 23 '20 at 13:39

Removing duplicate words without space in a vector in R

2 Answers2