3

I am a beginner trying to learn how to use regular expressions in Java. By going through a few online tutorials, I wrote the following sample codes to teach myself how regex with grouping operates, but the results are somewhat unintuitive.

String pattern = "((frok |dork )*)(\\w+) is (\\d+) Foo";
System.out.println(test1.matches(pattern));
System.out.println(test1.replaceAll(pattern, "$1"));        
System.out.println(test1.replaceAll(pattern, "$3"));

With test1 = frok dork dumb is 10 Foo, I get $1 as frok dork, $3 as dumb, as expected.

However, with test1 = frok dork is 10 Foo, I expected the match to fail. Instead, I get $1 as frok and $3 as dork. Why does the dork match with \\w+ here instead of ((frok | dork )*) as in earlier case?

I did search here on SO, but these posts (Java regular expression with groups, Regular expressions, groups issue, Is there a way to use a list of string parameters with a regular expression (with groups) to construct a new string?, Regular expression with variable number of groups?) do not address this issue.

Community
  • 1
  • 1
Masked Man
  • 1
  • 7
  • 40
  • 80

5 Answers5

3

A picture is worth a thousand words!

So here are your groups—explaining which part of the regex matches what.

RegexBuddy Grops

You need a regex debugger

The best way to see how things match or fail is to use a regex debugger. I work in regex all the time, and I wouldn't do it without debugging tools.

On Windows, the Rolls of regex debuggers is RegexBuddy. It is developed by Jan Goyvaerts, the author of The Regex Cookbook and of several regex-based tools. Online, regex101 is quite good.

zx81
  • 41,100
  • 9
  • 89
  • 105
  • FYI added a gorgeous screenshot. :) – zx81 Jun 23 '14 at 06:01
  • Thanks, I wasn't aware of these tools. :) Is there a way to force a "greedy" matching in the first group? I am using this for input validation, where input is expected to be of type: " is Foo". I want the match to fail if it is of the type, " is Foo", that is, with the missing. The non-keyword may not be previously seen, so I cannot explicitly look for a set of non-keywords. – Masked Man Jun 23 '14 at 06:47
  • Hi, glad to hear the answer helps. Sorry for the delay. I am not understanding your follow-up questions, I need examples. :) – zx81 Jun 25 '14 at 02:38
2

However, with test1 = frok dork is 10 Foo, I expected the match to fail. Instead, I get $1 as frok and $3 as dork. Why does the dork match with \\w+ here instead of ((frok | dork )*) as in earlier case?

Your first sentence here is the answer to your question. ((frok |dork )*) wants to match as many occurrences of frok  or dork  as it can; but the overriding consideration is, it wants the regex match to succeed. If a given part of the regex has to match a little bit less in order to get the match as a whole to succeed, then so be it.

For more information, I suggest Googling for regex + backtracking, greedy, and nongreedy.

ruakh
  • 175,680
  • 26
  • 273
  • 307
  • I don't think greediness is really relevant here. The results should be the same whether the first group is greedy or not. – shmosel Jun 23 '14 at 06:26
  • Thanks, is there any way to *prevent* the overriding consideration? I want the match to fail if it is NOT of the form " is Foo", since I am using this for input validation. The "non-keyword" may not be previously seen, so I cannot explicitly look to match those, and hence have to use `\\w+`. – Masked Man Jun 23 '14 at 06:52
  • @shmosel: Right; that's why I classified it as "more information". Understanding greedy and nongreedy quantifiers deepens one's understanding of the principles at work here, and learning about them is the obvious next step. – ruakh Jun 23 '14 at 06:53
  • 1
    @Happy: Yes; Java regexes support a notion of *possessive quantifiers* which, once they match, will never give back just part of what they matched. That's expressed with an extra `+`: `?` becomes `?+`, `*` becomes `*+`, `+` becomes `++`. So in your case, you would write `String pattern = "((frok |dork )*+)(\\w+) is (\\d+) Foo";` – ruakh Jun 23 '14 at 06:56
  • @Happy: By the way, I notice that in some of your comments to other answers, you seem to be describing `*` as "nongreedy". That is incorrect. "Greedy" refers to the fact that `*` wants to match as many repetitions as it can (even though it's willing to give some up if necessary); the "nongreedy" counterpart is `*?`, which wants to match as *few* repetitions as it can (though it's willing to match a bit if necessary). – ruakh Jun 23 '14 at 07:00
  • Thanks for your explanation. I will read more about it as you suggest and improve my understanding. – Masked Man Jun 23 '14 at 07:03
1

((frok |dork )*)(\\w+) means there can be any number of froks or dorks followed by a single word. In the first test, frok and dork both matched the first group, and dumb matched the next. But in your second test, dork had to be counted as the single following word in order to match the pattern. Therefore, only frok could be counted in the initial group.

shmosel
  • 49,289
  • 6
  • 73
  • 138
1

There are marking groups (expression) and there are non marking groups (?:expression). You see the difference. A non marking group has ?: after opening parenthesis.

The string found by the expression of a marking group can be backreferenced with $1, $2, ... or \1, \2, ....

The OR expression in your example should be a non marking group as the inner parentheses are just for the OR expression which should be applied 0 or more times.

Mofi
  • 46,139
  • 17
  • 80
  • 143
  • Thanks for the explanation. From the tutorials, I couldn't fully understand what ?: was used for, but your answer makes it clear. Is there a way to force "greedy" matching in the first group? I want the regex match to fail if the first group of "keywords" is NOT followed by a "non-keyword". – Masked Man Jun 23 '14 at 06:49
  • `"((?:frok |dork )*)(\\w+)(?<!frok|dork) is (\\d+) Foo"` can be used to find `frok` or `dork` 0 or more times and additionally require one more word before `is` which is whether `frok` nor `dork`. The expression `(?<!frok|dork)` is a negative lookbehind which checks if the string found by `\w+` is whether `frok` nor `dork`. – Mofi Jun 23 '14 at 07:16
0

You have nested groups in the regex: ((frok |dork )*) so frok is captured twice. Then, as you have guessed, dork is captured by the (\\w+)

Nir Alfasi
  • 53,191
  • 11
  • 86
  • 129