48

This is my Regex

((?:(?:'[^']*')|[^;])*)[;]

It tokenizes a string on semicolons. For example,

Hello world; I am having a problem; using regex;

Result is three strings

Hello world
I am having a problem
using regex

But when I use a large input string I get this error

Exception in thread "main" java.lang.StackOverflowError
at java.util.regex.Pattern$GroupHead.match(Pattern.java:4168)
at java.util.regex.Pattern$Loop.match(Pattern.java:4295)
at java.util.regex.Pattern$GroupTail.match(Pattern.java:4227)
at java.util.regex.Pattern$BranchConn.match(Pattern.java:4078)
at java.util.regex.Pattern$CharProperty.match(Pattern.java:3345)
at java.util.regex.Pattern$Branch.match(Pattern.java:4114)
at java.util.regex.Pattern$GroupHead.match(Pattern.java:4168)
at java.util.regex.Pattern$Loop.match(Pattern.java:4295)
at java.util.regex.Pattern$GroupTail.match(Pattern.java:4227)

How is this caused and how can I solve it?

BalusC
  • 1,082,665
  • 372
  • 3,610
  • 3,555
Ali
  • 7,810
  • 12
  • 42
  • 65

3 Answers3

64

Unfortunately, Java's builtin regex support has problems with regexes containing repetitive alternative paths (that is, (A|B)*). This is compiled into a recursive call, which results in a StackOverflow error when used on a very large string.

A possible solution is to rewrite your regex to not use a repititive alternative, but if your goal is to tokenize a string on semicolons, you don't need a complex regex at all really, just use String.split() with a simple ";" as the argument.

Jeen Broekstra
  • 21,642
  • 4
  • 51
  • 73
  • 2
    Most regex engines will have this problem. – NullUserException Sep 22 '11 at 05:43
  • I think I didn't put up my case clearly, sorry for that. String is not tokenized just on semicolon but it will be tokenized on many patterns at the same time, tokenizing on semicolon was just one simple case. – Ali Sep 22 '11 at 05:46
  • 1
    @Ali well, generally speaking: try to avoid alternatives in a wildcard. You could also give alternative regex libraries such as jregex a go, though I'm not sure that would solve the issue... – Jeen Broekstra Sep 22 '11 at 06:03
  • 2
    @NullUserException Not JavaScript at least. I'm doing the same on a string with 1600+ chars. It gives me StackOverflowError in Java. And yet JavaScript gives me the correct result within 1 ms. Tbh I'm really appalled by how a compiled language, with all its possible optimization, could perform worse, in terms of performance, than an interpreted language in this simple task. You could try this yourself (browser console in F12): `function test(str){console.time("test");console.log(/^([^']|'[^']*')*'[^']*$/.test(str));console.timeEnd("test")}` then `test("1000+ char string")` – zypA13510 Jan 26 '21 at 08:53
24

If you really need to use a regex that overflows your stack, you can increase the size of your stack by passing something like -Xss40m to the JVM.

Andrew
  • 897
  • 1
  • 9
  • 19
  • 1
    Increasing stack size via Xss property will increase possibility of OutOfMemoryError. Though it will fix current issue. Rather than increasing Xss refer top answer. – 100rabh Oct 18 '18 at 12:12
  • Helped me in solving a stack overflow error i am seeing while launching the eclipse . The error i was seeing was Could not create content describer for org.robotframework.red.robotsuitefile_robot. Content type has been disabled. java.lang.StackOverflowError – ChandraSPola Jan 17 '21 at 04:59
9

It might help to add a + after the [^;], so that you have fewer repetitions.

Isn't there also some construct that says “if the regular expression matched up to this point, don't backtrace”? Maybe that comes in handy, too. (Update: it is called possessive quantifiers).

A completely different alternative is to write a utility method called splitQuoted(char quote, char separator, CharSequence s) that explicitly iterates over the string and remembers whether it has seen an odd number of quotes. In that method you could also handle the case that the quote character might need to be unescaped when it appears in a quoted string.

'I'm what I am', said the fox; and he disappeared.
'I\'m what I am', said the fox; and he disappeared.
'I''m what I am', said the fox; and he disappeared.
Roland Illig
  • 40,703
  • 10
  • 88
  • 121