17

I have next code:

public static void createTokens(){
    String test = "test is a word word word word big small";
    Matcher mtch = Pattern.compile("test is a (\\s*.+?\\s*) word (\\s*.+?\\s*)").matcher(test);
    while (mtch.find()){
        for (int i = 1; i <= mtch.groupCount(); i++){
            System.out.println(mtch.group(i));
        }
    }
}

And have next output:

word
w

But in my opinion it must be:

word
word

Somebody please explain me why so?

Divers
  • 9,531
  • 7
  • 45
  • 88

2 Answers2

19

Because your patterns are non-greedy, so they matched as little text as possible while still consisting of a match.

Remove the ? in the second group, and you'll get
word
word word big small

Matcher mtch = Pattern.compile("test is a (\\s*.+?\\s*) word (\\s*.+\\s*)").matcher(test);
theglauber
  • 28,367
  • 7
  • 29
  • 47
  • And now the second group is capturing too much instead of too little. Non-greediness is not the problem, and greediness is not the solution. – Alan Moore Jan 19 '12 at 18:41
  • 1
    You're correct, but IMHO, the non-greedyness of the second capturing group explains why it captures simply "w". The first capturing group has to capture "word" because of the "word" literal following it. I don't know exactly what he's looking for and he edited the question after i submitted my answer, so i can't supply a correct regexp. – theglauber Jan 19 '12 at 18:49
3

By using \\s* it will match any number of spaces including 0 spaces. w matches (\\s*.+?\\s*). To make sure it matches a word separated by spaces try (\\s+.+?\\s+)

Garrett Hall
  • 29,524
  • 10
  • 61
  • 76
  • Trouble is, the regex is already consuming the space characters before and after the word, so now you're trying to consume them twice. – Alan Moore Jan 19 '12 at 18:46
  • All you would need to do is remove the space from the regex like ...`\\s+)word(\\s+`... – Daniel Gray Jul 05 '17 at 10:21