2

this is not homework. I'm just trying to learn/get better at regular expressions.

I'm trying to find 1 or more repeated words in a string. Actually, I'm trying to find 1 or more repeated words in a string and remove the repeats. I've looked at link1 and link2 and tried using their pattern(s) but they don't seem to work for me.

Here is what I have

String pattern = "\\b(\\w+)\\b\\s+\\1\\b";
Pattern p = Pattern.compile(pattern Pattern.CASE_INSENSITIVE);
//This is actually read from console
String input = "Goodbye bye bye world world world";
Matcher m = p.matcher(input);
while(m.fine())
{
    System.out.println("group: " + m.group() + " start: " + m.start() + " end: " + m.end());
    input = input.replaceAll(m.group(), m.group(1);
}
System.out.println(input);

And this is my output:
group: bye bye start: 8 end: 15 group(1): bye
group: world world start: 16 end: 27 group(1): world
Goodbye bye world world

What I'm expecting for the 2nd line of output is "group: world world world start: 16 end: 32.

So, to me, it seems like this is matching only the first repeated word. My understanding of the pattern is \b - word boundry, \w+ - on or more of the word (I'm not sure if it's the word repeated WITHOUT a space, i.e. 'wordword' or one or more of the word repeated WITH a space i.e' word word') then \b\s+ - followed by any white space \1 - the grouped word and finally \b - white space again.

Can some explain to me what's going on and what it should be?

Thanks!

Community
  • 1
  • 1
Victor
  • 173
  • 1
  • 10

1 Answers1

3

You are mostly right in your understanding of the regex, except the regex is only checking for two words in a row, not two or more words in a row.

To check for two or more words group the second part of your regex and put a plus after it so the word can be repeated more than twice like this:

\\b(\\w+)\\b(\\s+\\1\\b)+
Nathan Bierema
  • 1,813
  • 2
  • 14
  • 24
  • AH! Thanks! But a follow up to my understanding is \w+ = 'wordword' or 'word word' or ...? – Victor Apr 21 '16 at 04:32
  • \w stands for any letter, so \w+ stands for one or more letters next to each other. So \b\w+\b stands for the first word and the \s+\1\b stands for any words that match the first word after a space. so \w+ = 'word' (the first word) – Nathan Bierema Apr 21 '16 at 04:36
  • 1
    Yes, it's a shortcut for [a-zA-Z_0-9] – Nathan Bierema Apr 21 '16 at 04:44