29

Inspired by the question if {0} quantifier actually makes sense I started playing with some regexes containing {0} quantifier and wrote this small java program that just splits a test phrase based on various test regex:

private static final String TEST_STR =
    "Just a test-phrase!! 1.2.3.. @ {(t·e·s·t)}";

private static void test(final String pattern) {
    System.out.format("%-17s", "\"" + pattern + "\":");
    System.out.println(Arrays.toString(TEST_STR.split(pattern)));
}

public static void main(String[] args) { 
    test("");
    test("{0}");
    test(".{0}");
    test("([^.]{0})?+");
    test("(?!a){0}");
    test("(?!a).{0}");
    test("(?!.{0}).{0}");
    test(".{0}(?<!a)");
    test(".{0}(?<!.{0})");
} 

==> The output:

"":              [, J, u, s, t,  , a,  , t, e, s, t, -, p, h, r, a, s, e, !, !,  , 1, ., 2, ., 3, ., .,  , @,  , {, (, t, ·, e, ·, s, ·, t, ), }]
"{0}":           [, J, u, s, t,  , a,  , t, e, s, t, -, p, h, r, a, s, e, !, !,  , 1, ., 2, ., 3, ., .,  , @,  , {, (, t, ·, e, ·, s, ·, t, ), }]
".{0}":          [, J, u, s, t,  , a,  , t, e, s, t, -, p, h, r, a, s, e, !, !,  , 1, ., 2, ., 3, ., .,  , @,  , {, (, t, ·, e, ·, s, ·, t, ), }]
"([^.]{0})?+":   [, J, u, s, t,  , a,  , t, e, s, t, -, p, h, r, a, s, e, !, !,  , 1, ., 2, ., 3, ., .,  , @,  , {, (, t, ·, e, ·, s, ·, t, ), }]
"(?!a){0}":      [, J, u, s, t,  , a,  , t, e, s, t, -, p, h, r, a, s, e, !, !,  , 1, ., 2, ., 3, ., .,  , @,  , {, (, t, ·, e, ·, s, ·, t, ), }]
"(?!a).{0}":     [, J, u, s, t,  a,  , t, e, s, t, -, p, h, ra, s, e, !, !,  , 1, ., 2, ., 3, ., .,  , @,  , {, (, t, ·, e, ·, s, ·, t, ), }]
"(?!.{0}).{0}":  [Just a test-phrase!! 1.2.3.. @ {(t·e·s·t)}]
".{0}(?<!a)":    [, J, u, s, t,  , a , t, e, s, t, -, p, h, r, as, e, !, !,  , 1, ., 2, ., 3, ., .,  , @,  , {, (, t, ·, e, ·, s, ·, t, ), }]
".{0}(?<!.{0})": [Just a test-phrase!! 1.2.3.. @ {(t·e·s·t)}]

The following did not surprise me:

  1. "", ".{0}", and "([^.]{0})?+" just split before every character and that makes sense because of 0-quantifier.
  2. "(?!.{0}).{0}" and ".{0}(?<!.{0})" don't match anything. Makes sense to me: Negative Lookahead / Lookbehind for 0-quantified token won't match.

What did surprise me:

  1. "{0}" & "(?!a){0}": I actually expected an Exception here, because of preceding token not quantifiable: For {0} there is simply nothing preceding and for (?!a){0} not really just a negative lookahead. Both just match before every char, why? If I try that regex in a javascript validator, I get "not quantifiable error", see demo here! Is that regex handled differently in Java & Javascript?
  2. "(?!a).{0}" & ".{0}(?<!a)": A little surprise also here: Those match before every char of the phrase, except before/after the a. My understanding is that in (?!a).{0} the (?!a) Negative Lookahead part asserts that it is impossible to match the a literally, but I am looking ahead .{0}. I thought it would not work with 0-quantified token, but looks like I can use Lookahead with those too.

==> So the remaining mystery for me is why (?!a){0} is actually matching before every char in my test phrase. Shouldn't that actually be an invalid pattern and throw a PatternSyntaxException or something like that?


Update:

If I run the same Java code within an Android Activity the outcome is different! There the regex (?!a){0} indeed does throw an PatternSyntaxException, see:

03-20 22:43:31.941: D/AndroidRuntime(2799): Shutting down VM
03-20 22:43:31.950: E/AndroidRuntime(2799): FATAL EXCEPTION: main
03-20 22:43:31.950: E/AndroidRuntime(2799): java.lang.RuntimeException: Unable to start activity ComponentInfo{com.appham.courseraapp1/com.appham.courseraapp1.MainActivity}: java.util.regex.PatternSyntaxException: Syntax error in regexp pattern near index 6:
03-20 22:43:31.950: E/AndroidRuntime(2799): (?!a){0}
03-20 22:43:31.950: E/AndroidRuntime(2799):       ^
03-20 22:43:31.950: E/AndroidRuntime(2799):     at android.app.ActivityThread.performLaunchActivity(ActivityThread.java:2180)
03-20 22:43:31.950: E/AndroidRuntime(2799):     at android.app.ActivityThread.handleLaunchActivity(ActivityThread.java:2230)
03-20 22:43:31.950: E/AndroidRuntime(2799):     at android.app.ActivityThread.access$600(ActivityThread.java:141)
03-20 22:43:31.950: E/AndroidRuntime(2799):     at android.app.ActivityThread$H.handleMessage(ActivityThread.java:1234)
03-20 22:43:31.950: E/AndroidRuntime(2799):     at android.os.Handler.dispatchMessage(Handler.java:99)
03-20 22:43:31.950: E/AndroidRuntime(2799):     at android.os.Looper.loop(Looper.java:137)
03-20 22:43:31.950: E/AndroidRuntime(2799):     at android.app.ActivityThread.main(ActivityThread.java:5041)
03-20 22:43:31.950: E/AndroidRuntime(2799):     at java.lang.reflect.Method.invokeNative(Native Method)
03-20 22:43:31.950: E/AndroidRuntime(2799):     at java.lang.reflect.Method.invoke(Method.java:511)
03-20 22:43:31.950: E/AndroidRuntime(2799):     at com.android.internal.os.ZygoteInit$MethodAndArgsCaller.run(ZygoteInit.java:793)
03-20 22:43:31.950: E/AndroidRuntime(2799):     at com.android.internal.os.ZygoteInit.main(ZygoteInit.java:560)
03-20 22:43:31.950: E/AndroidRuntime(2799):     at dalvik.system.NativeStart.main(Native Method)
03-20 22:43:31.950: E/AndroidRuntime(2799): Caused by: java.util.regex.PatternSyntaxException: Syntax error in regexp pattern near index 6:
03-20 22:43:31.950: E/AndroidRuntime(2799): (?!a){0}
03-20 22:43:31.950: E/AndroidRuntime(2799):       ^
03-20 22:43:31.950: E/AndroidRuntime(2799):     at java.util.regex.Pattern.compileImpl(Native Method)
03-20 22:43:31.950: E/AndroidRuntime(2799):     at java.util.regex.Pattern.compile(Pattern.java:407)
03-20 22:43:31.950: E/AndroidRuntime(2799):     at java.util.regex.Pattern.<init>(Pattern.java:390)
03-20 22:43:31.950: E/AndroidRuntime(2799):     at java.util.regex.Pattern.compile(Pattern.java:381)
03-20 22:43:31.950: E/AndroidRuntime(2799):     at java.lang.String.split(String.java:1832)
03-20 22:43:31.950: E/AndroidRuntime(2799):     at java.lang.String.split(String.java:1813)
03-20 22:43:31.950: E/AndroidRuntime(2799):     at com.appham.courseraapp1.MainActivity.onCreate(MainActivity.java:22)
03-20 22:43:31.950: E/AndroidRuntime(2799):     at android.app.Activity.performCreate(Activity.java:5104)
03-20 22:43:31.950: E/AndroidRuntime(2799):     at android.app.Instrumentation.callActivityOnCreate(Instrumentation.java:1080)
03-20 22:43:31.950: E/AndroidRuntime(2799):     at android.app.ActivityThread.performLaunchActivity(ActivityThread.java:2144)
03-20 22:43:31.950: E/AndroidRuntime(2799):     ... 11 more

Why regex in Android behaves different than plain Java?

Community
  • 1
  • 1
donfuxx
  • 11,277
  • 6
  • 44
  • 76
  • 6
    Just fyi, you can test for java here: http://www.regexplanet.com/advanced/java/index.html – quazzieclodo Mar 04 '14 at 20:18
  • This seems almost to be better suited to the Code Golf stack exchange. – Mar Mar 04 '14 at 20:26
  • 1
    Lol, I didn't even know about this site :-) @MartinCarney – donfuxx Mar 04 '14 at 20:29
  • The link to Code Golf is http://codegolf.stackexchange.com/ . Stack Overflow (SO) is the original site, and has since been expanding into many subjects. It's kind of like our equivalent to a subreddit... – Mar Mar 04 '14 at 20:34
  • 2
    See also this question [How does {m}{n} work?](http://stackoverflow.com/a/18975475/7586) - I linked directly to my answer, if you don't mind, because I think it is relevant. Basically, Java ignored the redundant `{n}`, which is illegal in most other regex flavors (like in the example you've posted). – Kobi Mar 04 '14 at 20:34
  • 13
    This has nothing to do with code golf. This is about a deep understanding of how the java regex engine works. – Casimir et Hippolyte Mar 04 '14 at 20:58
  • 1
    It's not that surprising that some things are different in Android vs Java especially corner cases like this as Android Java is a reimplementation of Java, just as Linux is a reimplementation of Unix but has no direct access to the source. – GenericJam Mar 24 '14 at 00:04
  • 1
    Have you also compared it against OpenJDK, or if that's where you tested, have you tried Sun? Might be interesting to see if the test case spans multiple implementations. – avgvstvs Mar 26 '14 at 19:02
  • Well, it don't mean no a. – Chris Wesseling Mar 30 '14 at 21:45
  • If the regex `""` looks ok to you, then `{0}` should be ok too because it is just applying the quantifier `{0}` to the same regex as `""`. My 2 cents. It can't see any problem with `(?!a)}{0}` however because `(?!a)` is a regex. – Ludovic Kuty Apr 03 '14 at 13:03

1 Answers1

13

I did some looking into the source of oracles java 1.7.

"{0}"

I found some code that throws "Dangling meta character" when it finds ?, * or + in the main loop. That is, not immediately after some literal, group, "." or anywhere else where quantifiers are explicitly checked for. For some reason, { is not in that list. The result is that it falls through all checks for special characters and starts parsing for a literal string. The first character it encounters is {, which tells the parser it is time to stop parsing the literal string and check for quantifiers.

The result is that "{n}" will match empty string n times.

Another result is that a second "x{m}{n}" will first match x m times, then match empty string n times, effectively ignoring the {n}, as mentioned by @Kobi in the comments above.

Seems like a bug to me, but it wouldn't surprise me if they want to keep it for backwards compatibility.

"(?!a){0}"

"(?!a)" is just a node which is quantifiable. You can check if the next character is an 'a' 10 times. It will return the same result each time though, so it's not very useful. In our case, it will check if the next character is an 'a' 0 times, which will always succeed.

Note that as an optimization when a match has 0 length such as here, the quantifier is never greedy. This also prevents infinite recursion in the "(?!a)*" case.

"(?!a).{0}" & ".{0}(?<!a)"

As mentioned above, {0} performs a check 0 times, which always succeeds. It effectively ignores anything that comes before it. That means "(?!a).{0}" is the same as "(?!a)", which has the expected result.

Similar for the other one.

Android is different

As mentioned by @GenericJam, android is a different implementation and may have different characteristics in these edge cases. I tried looking at that source as well, but android actually uses native code there :)

Pelle Wielinga
  • 307
  • 2
  • 8
  • 1
    Android uses ICU regex, which is similar, but not the same as Java in term of supported constructs. – nhahtdh May 19 '15 at 08:26
  • It doesn't really matter how engines interpret quantifiers. The bottom line is quantifiers / repetition operators can't exist in a state machine when they don't quantify anything. And, they can't be interpreted as a standalone state. `For some reason, { is not in that list` - Are you saying a valid range quantifier is not parsed? That's not true. `{5}` is parsed somewhere as a repetition operator, otherwise it's a literal. There is no valid interpretation as a quantifier without valid preceding value. For that reason `x{3}{8}` is parsed as a nested quantifier, which has no meaning. –  May 07 '17 at 22:23
  • `This also prevents infinite recursion in the "(?!a)*" case.` All quantifiers on assertions resolve to 1 or 0. Has nothing to do with non-greedy or infinite loops. This can be tested with a benchmarked. `Regex1: (?!a){400000,100000000} Completed iterations: 50 / 50 ( x 1000 ) Matches found per iteration: 2 Elapsed Time: 0.26 s, 256.39 ms, 256387 µs Regex2: (?!a)+ Completed iterations: 50 / 50 ( x 1000 ) Matches found per iteration: 2 Elapsed Time: 0.25 s, 254.53 ms, 254529 µs` –  May 07 '17 at 22:41
  • Also seen here `Regex1: (?!a){400000,100000000} Completed iterations: 50 / 50 ( x 1000 ) Matches found per iteration: 2 Elapsed Time: 0.25 s, 254.34 ms, 254339 µs Regex2: (?!a){1} Completed iterations: 50 / 50 ( x 1000 ) Matches found per iteration: 2 Elapsed Time: 0.25 s, 254.06 ms, 254059 µs` So greedy/non-greedy have nothing to do with it. –  May 07 '17 at 22:45
  • @sln `Quantifiers cannot exist in a state machine when they don't quantify anything`. Exactly. That's why in the main loop, when there is nothing to quantify over, the code throws "Dangling meta character" when you try that. For example with the regex `"+"`. Only the `{` character falls through, which seems like a mistake. When Java parses characters which it CAN quantify over, it will check for `{` and works correctly. – Pelle Wielinga May 09 '17 at 19:22
  • @sln As for the examples: I mentioned there is an optimization in place makes sure matches of 0 length are never greedy. Without this optimization `(?!a){400000,100000000}` would be significantly slower. Greedy means it would keep trying to match `(?!a)` until it reaches the 100000000. In the lazy case it can stop at 400000. Similarly, with lazy `(?!a)+?` it can stop at 1 but with greedy `(?!a)+` it would have to try up until infinity. – Pelle Wielinga May 09 '17 at 19:52
  • @PelleWielinga - It's not that zero length matches are never greedy. Repetition operators are attached to a state at it's beginning. If there, it dictates how the state is repeated. Operators have their own separate code structure based on greedy/non-greedy, that is unaware of what it operates on. There is overhead to this, but the bottom line is it won't repeat past 1 time if it can't consume like assertions. Bench: `(?!a)` => _0.14 s_ `{200}` => _0.16 s_ `{800000}` => _0.16 s_ `{800000,30000000}` => _0.16 s_ `?` => _0.18 s_ `*` => _0.17 s_ `{0}` => _0.15 s_ `*?` => _0.15 s_ `+?` => _0.16 s_ –  May 10 '17 at 19:40