3

i am getting "Exception in thread "main" java.lang.StackOverflowError" when using regex:

(",(?=([^\"]*\"[^\"]*\")*[^\"]*$)") 

for a long string. Actually I want to split string on the basis of ','(which are outside of "" in .csv files) in .csv files. Its working fine for 450 column but giving error for more column as below---

Exception in thread "main" java.lang.StackOverflowError
    at java.util.regex.Pattern$Curly.match0(Pattern.java:4148)
    at java.util.regex.Pattern$Curly.match(Pattern.java:4132)
    at java.util.regex.Pattern$GroupHead.match(Pattern.java:4556)
    at java.util.regex.Pattern$Loop.match(Pattern.java:4683)
    at java.util.regex.Pattern$GroupTail.match(Pattern.java:4615)
    at java.util.regex.Pattern$BmpCharProperty.match(Pattern.java:3715)
    at java.util.regex.Pattern$Curly.match0(Pattern.java:4170)
    at java.util.regex.Pattern$Curly.match(Pattern.java:4132)
    at java.util.regex.Pattern$BmpCharProperty.match(Pattern.java:3715)
    at java.util.regex.Pattern$Curly.match0(Pattern.java:4170)
    at java.util.regex.Pattern$Curly.match(Pattern.java:4132)
    at java.util.regex.Pattern$GroupHead.match(Pattern.java:4556)
    at java.util.regex.Pattern$Loop.match(Pattern.java:4683)
    at java.util.regex.Pattern$GroupTail.match(Pattern.java:4615)
    at java.util.regex.Pattern$BmpCharProperty.match(Pattern.java:3715)
    at java.util.regex.Pattern$Curly.match0(Pattern.java:4170)
    at java.util.regex.Pattern$Curly.match(Pattern.java:4132)
    at java.util.regex.Pattern$BmpCharProperty.match(Pattern.java:3715)
    at java.util.regex.Pattern$Curly.match0(Pattern.java:4170)
    at java.util.regex.Pattern$Curly.match(Pattern.java:4132)
    at java.util.regex.Pattern$GroupHead.match(Pattern.java:4556)
    at java.util.regex.Pattern$Loop.match(Pattern.java:4683)
    at java.util.regex.Pattern$GroupTail.match(Pattern.java:4615)
    at java.util.regex.Pattern$BmpCharProperty.match(Pattern.java:3715)
    at java.util.regex.Pattern$Curly.match0(Pattern.java:4170)
    at java.util.regex.Pattern$Curly.match(Pattern.java:4132)
    at java.util.regex.Pattern$BmpCharProperty.match(Pattern.java:3715)
    at java.util.regex.Pattern$Curly.match0(Pattern.java:4170)
    at java.util.regex.Pattern$Curly.match(Pattern.java:4132)
    at java.util.regex.Pattern$GroupHead.match(Pattern.java:4556)
    at java.util.regex.Pattern$Loop.match(Pattern.java:4683)
    at java.util.regex.Pattern$GroupTail.match(Pattern.java:4615)
    at java.util.regex.Pattern$BmpCharProperty.match(Pattern.java:3715)
    at java.util.regex.Pattern$Curly.match0(Pattern.java:4170)
    at java.util.regex.Pattern$Curly.match(Pattern.java:4132)
    at java.util.regex.Pattern$BmpCharProperty.match(Pattern.java:3715)
    at java.util.regex.Pattern$Curly.match0(Pattern.java:4170)
    at java.util.regex.Pattern$Curly.match(Pattern.java:4132)
    at java.util.regex.Pattern$GroupHead.match(Pattern.java:4556)
    at java.util.regex.Pattern$Loop.match(Pattern.java:4683)
    at java.util.regex.Pattern$GroupTail.match(Pattern.java:4615)
    at java.util.regex.Pattern$BmpCharProperty.match(Pattern.java:3715)
    at java.util.regex.Pattern$Curly.match0(Pattern.java:4170)
    at java.util.regex.Pattern$Curly.match(Pattern.java:4132)
    at java.util.regex.Pattern$BmpCharProperty.match(Pattern.java:3715)
    at java.util.regex.Pattern$Curly.match0(Pattern.java:4170)
    at java.util.regex.Pattern$Curly.match(Pattern.java:4132)
    at java.util.regex.Pattern$GroupHead.match(Pattern.java:4556)
    at java.util.regex.Pattern$Loop.match(Pattern.java:4683)
    at java.util.regex.Pattern$GroupTail.match(Pattern.java:4615)
    at java.util.regex.Pattern$BmpCharProperty.match(Pattern.java:3715)
    at java.util.regex.Pattern$Curly.match0(Pattern.java:4170)
    at java.util.regex.Pattern$Curly.match(Pattern.java:4132)
    at java.util.regex.Pattern$BmpCharProperty.match(Pattern.java:3715)
    at java.util.regex.Pattern$Curly.match0(Pattern.java:4170)
    at java.util.regex.Pattern$Curly.match(Pattern.java:4132)
    at java.util.regex.Pattern$GroupHead.match(Pattern.java:4556)
    at java.util.regex.Pattern$Loop.match(Pattern.java:4683)
    at java.util.regex.Pattern$GroupTail.match(Pattern.java:4615)
    at java.util.regex.Pattern$BmpCharProperty.match(Pattern.java:3715)
    at java.util.regex.Pattern$Curly.match0(Pattern.java:4170)
    at java.util.regex.Pattern$Curly.match(Pattern.java:4132)
    at java.util.regex.Pattern$BmpCharProperty.match(Pattern.java:3715)
    at java.util.regex.Pattern$Curly.match0(Pattern.java:4170)
    at java.util.regex.Pattern$Curly.match(Pattern.java:4132)
    at java.util.regex.Pattern$GroupHead.match(Pattern.java:4556)
    at java.util.regex.Pattern$Loop.match(Pattern.java:4683)
    at java.util.regex.Pattern$GroupTail.match(Pattern.java:4615)
    at java.util.regex.Pattern$BmpCharProperty.match(Pattern.java:3715)
    at java.util.regex.Pattern$Curly.match0(Pattern.java:4170)
    at java.util.regex.Pattern$Curly.match(Pattern.java:4132)
    at java.util.regex.Pattern$BmpCharProperty.match(Pattern.java:3715)
    at java.util.regex.Pattern$Curly.match0(Pattern.java:4170)
    at java.util.regex.Pattern$Curly.match(Pattern.java:4132)
    at java.util.regex.Pattern$GroupHead.match(Pattern.java:4556)
fge
  • 119,121
  • 33
  • 254
  • 329
  • You are a victim of catastrophic backtracking probably. See [here](http://stackoverflow.com/questions/17043454/using-regexes-how-to-efficiently-match-strings-between-double-quotes-with-embed) for a good way to match "double-quoted, \"escaped\" strings" – fge May 08 '14 at 07:21
  • Also, as this seems to be CSVs, why not use OpenCSV? – fge May 08 '14 at 07:25
  • It's unfortunate that there is not a good CSV library available (to my knowledge). This problem stems from how one should interpret a CSV file. At first, you can specify a delimiter `','` and a quotation character `'"'`. The quotation character surrounds fields, so for instance `"Hello, my", name, " is ", jared` would create the following fields: `{"Hello, my", "name", " is ", "jared"}`. You can iterate through the characters of the line to generate these fields (line-by-line, character-by-character). It becomes an issue when the quotation character needs to be part of the field! – Jared May 08 '14 at 07:34

2 Answers2

3

Use an atomic group instead of a capturing group which you don't need:

,(?=(?>[^\"]*\"[^\"]*\")*[^\"]*$)

That should speed things up and prevent unnecessary backtracking.

Tim Pietzcker
  • 328,213
  • 58
  • 503
  • 561
0

I have an issue with finding a long string by a regexp (when line is more then 25k characters long). And I fix it by adding plus (+) into the end of my regexp.

see Possessive quantifiers https://docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html

Here is my regexp after changes

// find string using pattern `normal* (special normal*)*` 
// where special — any escaped symbol
Pattern stringRe = Pattern.compile("\"[^\\\\\"]*(\\\\.[^\\\\\"]*)*+\"");

Here is a complete code to parse json where I encounter a StackOverflowError

Pattern spaceRe = Pattern.compile("[\\s\\t\\n ]+");
// THIS PATTERN IS SAFE FOR A LONG STRING
// find string using pattern `normal* (special normal*)*` 
// where special — any escaped symbol
Pattern stringRe = Pattern.compile("\"[^\\\\\"]*(\\\\.[^\\\\\"]*)*+\"");
Pattern numberRe = Pattern.compile("\\d+(?:.\\d+)?");
Pattern boolRe = Pattern.compile("true|false");
Pattern nilRe = Pattern.compile("null");
Pattern openCurlyBraceRe = Pattern.compile("\\{");
Pattern closeCurlyBraceRe = Pattern.compile("\\}");
Pattern openSquareBracketRe = Pattern.compile("\\[");
Pattern closeSquareBracketRe = Pattern.compile("\\]");
Pattern colonRe = Pattern.compile(":");
Pattern commaRe = Pattern.compile(",");
Pattern re = Pattern.compile("(?:" + spaceRe +
        "|" + stringRe.pattern() +
        "|" + numberRe.pattern() +
        "|" + boolRe.pattern() +
        "|" + nilRe.pattern() +
        "|" + openCurlyBraceRe.pattern() +
        "|" + closeCurlyBraceRe.pattern() +
        "|" + openSquareBracketRe.pattern() +
        "|" + closeSquareBracketRe.pattern() +
        "|" + colonRe.pattern() +
        "|" + commaRe.pattern() + ")"
);
int i = 0;
Matcher matcher = re.matcher(json);
while (matcher.find()) {
   // some staff for parsing
}

some additional information on the https://regular-expressions.mobi/possessive.html?wlr=1

When Possessive Quantifiers Matter

The main practical benefit of possessive quantifiers is to speed up your regular expression. In particular, possessive quantifiers allow your regex to fail faster. In the above example, when the closing quote fails to match, we know the regular expression couldn’t possibly have skipped over a quote. So there’s no need to backtrack and check for the quote. We make the regex engine aware of this by making the quantifier possessive. In fact, some engines, including the JGsoft engine, detect that [^"]* and " are mutually exclusive when compiling your regular expression, and automatically make the star possessive.

Joter
  • 316
  • 2
  • 6