0

I'm trying to replace any duplicated letter with a one letter.
I use gsub here and it's working:

text <- c("This tree is veeeeery tall")
gsub("([a-zA-Z])\\1+", "\\1", text)
##[1] "This tre is very tal"

BUT I need to make exception for some words to be like this:

"This tree is very tall"

I tried the solution in this question Here but it doesn't work.

text <- c("This tree is veeeeery tall")
words2keep <- c("tree", "tall")
gsub(perl=T,paste0('(?!\\b',paste(collapse='\\b|\\b',words2keep),'\\b)\\b([a-zA-Z])\\1+\\b'),'\\1',text)
##[1] "This tree is veeeeery tall"

So, is there any way to do it?

M--
  • 25,431
  • 8
  • 61
  • 93
  • 2
    It would be easier to help you with a [reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) with sample input and desired output. Show the code you actually tried and say exactly what "doesn't work" means. – MrFlick Oct 26 '17 at 15:02
  • @MrFlick I edited can you check it now – Amani AlFarasani Oct 26 '17 at 15:27
  • 1
    It is quite easy with a PCRE regex. All you need is to match the exception words and skip them with [`\b(?:tree|tall)\b(*SKIP)(*F)|([a-zA-Z])\1+`](https://regex101.com/r/1HDyq1/1) regex. See the [**R demo**](https://ideone.com/jRX9mM). – Wiktor Stribiżew Oct 26 '17 at 17:07
  • @WiktorStribiżew This is awesome! First time learning this method of `(*SKIP)(*FAIL)`. Please make this an answer. – acylam Oct 26 '17 at 18:58
  • I see it was re-opened, I "converted" (not "moved") my comment to an answer. – Wiktor Stribiżew Oct 26 '17 at 19:34

2 Answers2

2

With a PCRE perl=TRUE option, it is easy to introduce exceptions to a regex. All you need is an alternation operator that will delimit two main parts: the first, left, part is what we match and skip, and the second is what we want to actually process.

\b(?:tree|tall)\b(*SKIP)(*F)|([a-zA-Z])\1+

See the regex demo

Details

  • \b(?:tree|tall)\b(*SKIP)(*F) - a leading word boundary, a whole word tree or tall, a trailing word boundary, and the combination of the 2 PCRE verbs (*SKIP)(*F) that make the regex engine skip the match and proceed looking for the next one from the current position (end of the skipped match)
  • | - or
  • ([a-zA-Z])\1+ - any ASCII letter captured into Group 1 and then one or more repetitions of the same letter (note that \p{L} with (*UCP) verb makes the pattern fully Unicode-aware)

To build the regex dynamically in R, you need to paste the exception word vector into the left part of the regex:

text <- c("This tree is veeeeery tall")
words2keep <- c("tree", "tall")
p <- paste0('\\b(?:',paste(collapse='|',words2keep),')\\b(*SKIP)(*F)|([A-Za-z])\\1+')
## OR: p <- paste0('(*UCP)\\b(?:',paste(collapse='|',words2keep),')\\b(*SKIP)(*F)|(\\p{L})\\1+')
p
## => [1] "\\b(?:tree|tall)\\b(*SKIP)(*F)|([A-Za-z])\\1+"
gsub(p, '\\1',text, perl=TRUE)
## => [1] "This tree is very tall"

See the R demo online

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
0

Here is a solution with str_replace_all from stringr:

text1 <- c("This tree is veeeeery tall")
text2 <- c("This tree is vaeeeeery tall")
text3 <- c("This tree is eeeeery tall")
words2keep <- c("tree", "tall")

library(stringr)
replace_func = function(string){
  str_replace_all(string, "(\\w)\\1+", "\\1")
}

names(words2keep) = replace_func(words2keep)


text_clean1 = replace_func(text1)
str_replace_all(text_clean1, words2keep)
# [1] "This tree is very tall"

text_clean2 = replace_func(text2)
str_replace_all(text_clean2, words2keep)
# [1] "This tree is vaery tall"

text_clean3 = replace_func(text3)
str_replace_all(text_clean3, words2keep)
# [1] "This tree is ery tall"

This solution first runs words2keep through the same str_replace_all that text is going to go through and make the result the names of words2keep:

> words2keep
   tre    tal 
"tree" "tall"

The same str_replace_all is then applied to text to remove all repeating word characters:

> replace_func(text1)
[1] "This tre is very tal" 

Finally, the trick is to have a third str_replace_all that replaces the incorrectly modified words with the original words by supplying the named words2keep vector.

acylam
  • 18,231
  • 5
  • 36
  • 45