7

Say I have a string, like:

where is mummy where is daddy

I want to replace any set of repeating substrings with empty strings - so in this case the where and is elements would be removed and the resulting string would be:

mummy daddy

I was wondering if there was any single regex that could achieve this. The regex I tried (which doesn't work) looked like the following:

/(\w+)(?=.*)\1/gi

Where the first capture group is any set of word characters, the second is a positive look ahead to any set of characters (in order to prevent those characters from being included in the result) and then the \1 is a backreference to the first matched substring.

Any help would be great. Thanks in advance!

jonny
  • 3,022
  • 1
  • 17
  • 30
  • 1
    Repeated words are fixed or do you want to find out repeated words first and then replace them? – gurvinder372 Mar 21 '16 at 09:28
  • 2
    Maybe [`(\b\w+\b)(?=.*\1)`](https://regex101.com/r/nY3sO4/1)? – Wiktor Stribiżew Mar 21 '16 at 09:29
  • @gurvinder372 Find repeated words first and then replace them I suppose - I was wondering if there was a single regex which could achieve this – jonny Mar 21 '16 at 09:29
  • @WiktorStribiżew That matches the first `where` and `is` but not the second, I'd like to match them globally. Is that possible? – jonny Mar 21 '16 at 09:30
  • So the final goal is `mummy daddy`? – Wiktor Stribiżew Mar 21 '16 at 09:32
  • @WiktorStribiżew exactly. I'll update the question – jonny Mar 21 '16 at 09:32
  • try this in JS console in browser... should this be fine in case words are exactly "where is"? "where is mummy where is daddy".replace(/where\s+is/gi, "") – Anil Namde Mar 21 '16 at 09:36
  • Possible duplicate of [Regular Expression For Consecutive Duplicate Words](http://stackoverflow.com/questions/2823016/regular-expression-for-consecutive-duplicate-words) – RIYAJ KHAN Mar 21 '16 at 09:37
  • @AnilNamde It needs to be generic. Any set of duplicate words – jonny Mar 21 '16 at 09:37
  • @RIYAJKHAN It's not a duplicate - that regex only works for consecutive repeating strings, I'm looking for a regex which works over an entire string. – jonny Mar 21 '16 at 09:40

1 Answers1

11

Your regex does not work because the \w+ is not restricted with word boundaries and the \1 backreference is tried to match right after the "original" word, which is almost never true.

You need to first get the words that are dupes, and then build a RegExp to match them all with optional whitespace (or punctuation, etc. - adjust the pattern later) and replace with an empty string:

var re = /(\b\w+\b)(?=.*\b\1\b)/gi;                  // Get the repeated whole words
var str = 'where is mummy where is daddy';
var patts = str.match(re);                       // Collect the matched repeated words
var res = str.replace(RegExp("\\s*\\b(?:" + patts.join("|") +")\\b", "gi"), ""); //  Build the pattern for replacing all found words
document.body.innerHTML = res;

The first pattern is (\b\w+\b)(?=.*\b\1\b):

  • (\b\w+\b) - match and capture into Group 1 a whole word consisting of [A-Za-z0-9_] characters
  • (?=.*\b\1\b) - make sure this value captured into Group 1 is repeated somewhere to the right of the current location (not necessarily right after the word). If the string is multiline, use [\s\S] instead of the dot. To make sure we match original and dupe words as whole words, \b word boundaries should be used around both \w+ and \1.

The second pattern will look different each time, but in your current scenario, it will be /\s*\b(?:where|is)\b/gi:

  • \s* - zero or more whitepsace
  • \b(?:where|is)\b - a whole word from the alternation group (?:...|...): either where or is (case-insensitive due to /i modifier).
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • 2
    Great answer! Playing with your code I stumbled across a surprising problem though. It seems the checks for the word boundaries are not part of the first capture group. Therefore if you use it on the string "where is my mummy where is daddy" the word "my" is also deleted because it appears in "mummy". To avoid false positives you have to add the checks for the word boundaries around the reapetition of the first capture group again (var re = /(\b\w+\b)(?=.*\b\1\b)/gi). – Florian Sandro Völkl Mar 21 '16 at 10:45
  • Yes, you also need then the word boundaries around the backreference, otherwise, the dupe word check is not correct. I updated the answer to reflect that aspect. – Wiktor Stribiżew Mar 21 '16 at 10:51