1

I am working on a problem that removes duplicated words from a string. E.g.,

Input: Goodbye bye bye world world world

Output: Goodbye bye world

I have got a working pattern from online resources, but I am not able to understand all the content in it.

    String pattern = "\\b(\\w+)(\\b\\W+\\b\\1\\b)*";

Here is my understanding:

  1. the initial \\b is to match word bounaries
  2. (\\w+) matches one or more characters
  3. in this expression : (\\b\\W+\\b\\1\\b)*

    a. \\b matches word boundaries

    b. \\W+ matches one or more non-word characters

    c. \\b again matches a word bounary

    d. \\1 ??? I dont know what this is for, but it wont work without this

    c. \\b again matches for a word bounary

As you can see, my main confusion is about item 3 and especially \\1. Anyone can explain it more clearly?

Alfabravo
  • 7,493
  • 6
  • 46
  • 82
drdot
  • 3,215
  • 9
  • 46
  • 81
  • Hi. I always use regexr to test and try regular expressions [click here](http://regexr.com/) if you put the pointer over the expressions it shows messages and it explains what is going on – Gabriel Marques Jan 23 '17 at 19:20
  • @GabrielMarques, thanks for the link. However, neither my pattern or the one written by anubhava work in this web editor. Is the syntax the same as java regex? – drdot Jan 23 '17 at 19:33
  • yes, try to remove the double back slash character '\' and it wil works. You use double back slashes cause youbare writing the expression in a string and you double it to escape – Gabriel Marques Jan 23 '17 at 19:52
  • 1
    @anubhava Yes, thank you! – drdot Jan 14 '19 at 16:29

1 Answers1

7

Using Java you can use a lookahead to remove all the words that have same matched word ahead using a back-reference:

final String regex = "\\b(\\w+)\\b\\s*(?=.*\\b\\1\\b)";
final String input = "Goodbye bye bye world world world\n";

final String result = input.replaceAll(regex, "");

It is important to use word boundaries here to avoid matching partial words.

RegEx Demo

anubhava
  • 761,203
  • 64
  • 569
  • 643