1

I can't figure out how to write a regular expression in R that would "match" only duplicate consecutive words without space such as:

Country<-c("GhanaGhana","Bangladesh","Pakistan","Mongolia","India",
"Indonesia","UgandaUganda","ArmeniaArmenia","Sri LankaSri Lanka",
"U.S. Virgin IslandsU.S. Virgin Islands") 

and transform into this:

Country<-c("Ghana","Bangladesh","Pakistan","Mongolia","India","Indonesia",
           "Uganda","Armenia","Sri Lanka","U.S. Virgin Islands")

Traditional R function for this like anyDuplicated() or unique() is not working. Is there a way to write a regular expression for this in R?

DATAUNIRIO
  • 80
  • 6
  • 4
    Another approach would be to write a function to check words with even nchar by splitting them in the middle and in case of duplicates only keep one of the two. – user12728748 Jun 23 '20 at 13:28
  • I agree with @user12728748 - this would not be easy to do with regular expressions. – Kylie R. Jun 23 '20 at 13:30

2 Answers2

6
stringr::str_replace(Country, "^(.*)\\1$", "\\1")
# [1] "Ghana"               "Bangladesh"          "Pakistan"            "Mongolia"            "India"              
# [6] "Indonesia"           "Uganda"              "Armenia"             "Sri Lanka"           "U.S. Virgin Islands"

## or in base (using perl = TRUE for efficiency)
sub("^(.*)\\1$", "\\1", Country, perl = TRUE)
## same result

In general, the pattern "^(.*)\\1$" matches a string that is entirely a repeated sequence - (.*) creates a matching group, and \\1 refers to the first matching group (just learned that using \\1 in the pattern itself is possible, thanks to this answer). We replace the entire repeated string with the first matching group.

Gregor Thomas
  • 136,190
  • 20
  • 167
  • 294
4

You can compare the first half of the string against the second half, if the are the same then cut it in half

DoubledStrings <- substring(Country, 1, nchar(Country)/2 ) == substring(Country, nchar(Country)/2+1, nchar(Country))
Country[DoubledStrings] <- substring(Country, 1, nchar(Country)/2 )[DoubledStrings]

> Country
 [1] "Ghana"               "Bangladesh"          "Pakistan"           
 [4] "Mongolia"            "India"               "Indonesia"          
 [7] "Uganda"              "Armenia"             "Sri Lanka"          
[10] "U.S. Virgin Islands"
Daniel O
  • 4,258
  • 6
  • 20
  • 2
    If the strings are at all long, I wouldn't be surprised if this answer quickly becomes much more efficient than regex. – Gregor Thomas Jun 23 '20 at 13:39