It looks like you're trying to capture the contents of the RHS string, and at the same time enforcing that it is preceded by "body":"
and followed by "
.
You seem to be using lookaround assertions to test for the presence of the surrounding text, but you are also using a capture group to capture the contents of the RHS string. You don't need to do both of these things. Lookaround assertions are zero-width, meaning they will not form part of the final matched substring. The final matched substring is always accessible as capture group 0. Alternatively, you could fully match all regex components (meaning non-zero-width matching, meaning no lookarounds) and use a capture group to extract the substring of interest, but this would be less efficient.
Here's how I think this should be written (matching against args[0]
for this demo):
Pattern p = Pattern.compile("(?<=\"body\":\")[^\"]*(?=\")");
Matcher m = p.matcher(args[0]);
if (!m.find(0)) { System.out.println("doesn't match!"); System.exit(1); }
System.out.println(m.group(0));
The above works for me with fairly large strings.
I did try to reproduce the StackOverflowError
exception, and I succeeded. It looks to me like the Java regex engine is using recursion to implement matching of repeated alternations. This is very surprising to me, since I don't know why recursion would be necessary at all to match repeated alternations. That being said, I also did a little test with Perl regular expressions, which I've always considered to be the most powerful and robust regular expression flavor in existence, and discovered to my further surprise that Perl fails in exactly the same way as Java regular expressions.
Below is an example of this, showing both Java failing and Perl failing. For this demo I changed the [^"]
atom to the alternation (?:\\.|[^"])
, which effectively adds support for backslash escape codes embedded in the double-quoted string, such as \"
to encode an embedded double-quote, which is commonly supported in many programming environments.
Java
Pattern p = Pattern.compile("(?<=\"body\":\")(?:\\\\.|[^\"])*(?=\")");
Matcher m = p.matcher(args[0]);
if (!m.find(0)) { System.out.println("doesn't match!"); System.exit(1); }
System.out.println(m.group(0));
Output
Exception in thread "main" java.lang.StackOverflowError
at java.util.regex.Pattern$CharProperty.match(Unknown Source)
at java.util.regex.Pattern$Branch.match(Unknown Source)
at java.util.regex.Pattern$GroupHead.match(Unknown Source)
at java.util.regex.Pattern$Loop.match(Unknown Source)
at java.util.regex.Pattern$GroupTail.match(Unknown Source)
at java.util.regex.Pattern$BranchConn.match(Unknown Source)
at java.util.regex.Pattern$CharProperty.match(Unknown Source)
at java.util.regex.Pattern$Branch.match(Unknown Source)
at java.util.regex.Pattern$GroupHead.match(Unknown Source)
at java.util.regex.Pattern$Loop.match(Unknown Source)
at java.util.regex.Pattern$GroupTail.match(Unknown Source)
...
Perl (from the shell)
largeString="\"body\":\"$(perl -e 'use strict; use warnings; for (my $i = 0; $i < 2**15; ++$i) { print("x"); }';)\"";
perl -e 'use strict; use warnings; my $re = qr((?<="body":")(?:\\.|[^"])*(?=")); if ($ARGV[0] !~ $re) { print("didn'\''t match!\n"); } print($&);' "$largeString";
Output
Complex regular subexpression recursion limit (32766) exceeded at -e line 1.
didn't match!
Use of uninitialized value $& in print at -e line 1.
So, just to clarify, the reason why my solution given near the start of my answer avoids this stack overflow error is not because I removed capture group 1, but rather because I removed the alternation. Again, I don't know why repeated alternations are implemented with recursion, but given that fact, it seems only logical that a large input string would lead to a stack overflow error.