3

In R, I have a string where it contains repeated groups of characters:

testString <- "Hi hi missing u lollol hahahahalol sillybilly haaaaa!"

I'm trying to use a gsub regex to replace repeated groups of characters within each word to produce the following output:

"Hi hi missing u lol halol sillybilly haaaaa!"

I've tried the following line but it isn't producing the right output:

gsub("[[:blank:]](.+?){2,}[[blank]]\\1",
replacement="\\1", testString, perl=TRUE)

What have I done wrong?

Sotos
  • 51,121
  • 6
  • 32
  • 66
SimonsSchus
  • 157
  • 7
  • 1
    Why `haaaaa`? Maybe `haaa`? Why is it expected to be unmodified? – Wiktor Stribiżew May 05 '17 at 18:12
  • I want to keep repeating single characters together (e.g. 'aa' is allowed). However, repeating multiple strings are not ('haha' becomes 'ha'; 'haahaa' becomes 'haa'). – SimonsSchus May 05 '17 at 18:13
  • 1
    you'll need a [backreference](http://www.regular-expressions.info/backref.html) – MichaelChirico May 05 '17 at 18:15
  • 1
    this gets pretty close... `gsub('((([A-Za-z]+)[^\\1]+)\\2+)', '\\3', testString) ` – MichaelChirico May 05 '17 at 18:17
  • 1
    @WiktorStribiżew can you help diagnose what's wrong with my approach? Any chance to salvage it? – MichaelChirico May 05 '17 at 18:22
  • 1
    @MichaelChirico: It won't work because you cannot put a backreference into a bracket expression. – Wiktor Stribiżew May 05 '17 at 18:23
  • 1
    @WiktorStribiżew so what's happening there? since it compiled, I guess `[^\\1]` was just ignoring `\ ` and `1` as characters? – MichaelChirico May 05 '17 at 18:24
  • 2
    @MichaelChirico: Yes, since you are using a TRE regex, `"[^\\1]"` matches any char but a ``\`` and `1`. If you use a PCRE regex, it matches any char but a SOH char then. See [this R demo](http://ideone.com/6jxNMY). However, this approach does not work when fixed - see [this demo](https://regex101.com/r/SncsPY/3). – Wiktor Stribiżew May 05 '17 at 18:30

1 Answers1

3

You may match repeated consecutive word chars and skip them, and then handle all other repeated consecutive chars with a solution like

x <- "Hi hi missing u lollol hahahahalol sillybilly haaaaa!"
gsub("(\\w)\\1+(*SKIP)(*F)|(\\w+?)\\2+", "\\2", x, perl=TRUE)

See the regex demo and an online R demo

Details:

  • (\\w)\\1+(*SKIP)(*F) - match and capture a word char (with (\\w), this can be adjusted) and then 1+ ocurrences of this same char (with \\1+) and then the whole text is discarded and the engine goes on to search for another match after the end of the match (with the PCRE (*SKIP)(*FAIL) verbs sequence)
  • | - or
  • (\\w+?)\\2+ - 1 or more word chars, as few as possible, are captured into Group 2 (with (\\w+?)) and then 1+ occurrences of the same value are matched (with \\2+).

The replacement is just the Group 2 value.

Community
  • 1
  • 1
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • This was a really good learning example for me (for one, I had never seen *SKIP or *F before). I'm still working through this to check it out. – SimonsSchus May 05 '17 at 18:30
  • 1
    @SimonsSchus: Why is the output not expected? It is `Hi hi missing u lol halol sillybilly haaaaa!`. – Wiktor Stribiżew May 05 '17 at 18:32
  • Because I'm a fool, and I was putting the wrong testString string in! I will reassess and try to understand. – SimonsSchus May 05 '17 at 18:33
  • 1
    BTW, Michael's solution, when fixed, [does not produce the required outout](https://regex101.com/r/SncsPY/3). – Wiktor Stribiżew May 05 '17 at 18:34
  • This approach works. I'm having trouble with legitimate words with repetition (e.g. banana). I could likely update the expression to require matches only when there are two or more repetitions (e.g. 'hahaha' becomes 'ha', but 'haha' is retained as 'haha'). – SimonsSchus May 05 '17 at 18:38
  • The example of my above 'banana' suggestion using the code by @Wiktor Stribiżew is: `gsub("(\\w)\\1+(*SKIP)(*F)|(\\w+?)\\2{2,}", "\\2", testString, perl=TRUE)` – SimonsSchus May 05 '17 at 18:45
  • 1
    @SimonsSchus: You may add the words to the SKIP-FAIL part: `"(?:banana|(\\w)\\1+)(*SKIP)(*F)|(\\w+?)\\2+"`. But I also tested with `{2,}` at the beginning, before posting the answer. Adjust as needed. – Wiktor Stribiżew May 05 '17 at 19:06
  • 2
    beware of "ratatat" and "logogogue" – MichaelChirico May 05 '17 at 19:26