Stack Overflow in java regex

Question

I am new in java. I am getting java Stack overflow Exception in regex strHindiText. What should I do for that?

try {
     // This regex convert the pattern "{\fldrslt {\fcs1 \ab\af24 \fcs0 &#2345;}{"
     // into "{\fldrslt {\fcs1 \ab\af24 \fcs0 &#2345;}}}{"
     // strHindiText = strHindiText.replaceAll("\\{(\\\\fldrslt[ ])\\{((\\\\\\S+[ ])+)((\\s*&#\\d+;\\s*(-|,|/|\\(|\\)|\"|;|\\.|'|<|>|:|\\?)*)+)\\}\\{","{$1{$2$4}}}{");

     // This regex convert the pattern "{\fcs0 \af0 &#2345;{ or {\fcs0 \af0 *\tab &#2345;{" 
     // into "{\fcs0 \af0 &#2345; }{"
     strHindiText = strHindiText.replaceAll("\\{\\s*((\\\\\\S+[ ](\\*)?)+\\s*)(-|,|/|\\(|\\)|\"|;|\\.|'|<|>|:|\\?)*[ ]*(((&#\\d+;)[ ]*(-|,|/|\\(|\\)|\"|;|\\.|'|<|>|:|\\?)*[ ]*)+)\\{", "{$1 $4$5 }{");

     // This regex convert the pattern "{&#2345; \fcs0 \af0 {" 
     // into "{&#2345; \fcs0 \af0 }{"
     strHindiText = strHindiText.replaceAll("\\{\\s*(((&#\\d+;)[ ]*(-|,|/|\\(|\\)|\"|;|\\.|'|<|>|:|\\?)*[ ]*)+)[ ]*((\\\\\\S+[ ])+)\\{", "{$1 $5 }{");

     } catch(StackOverflowError er) {
            System.out.println("Third try Block StackOverflowError in regex pattern to reform the rtf tags................");
            er.printStackTrace();
        //  throw er;
     }

Whenever these strHindiText contain large data it gives an java stackoverflow exception:

java.lang.StackOverflowError
2013-08-08 15:35:07,743 ERROR [STDERR] (http-127.0.0.1-80-9)    at java.util.regex.Pattern$Curly.match0(Pattern.java:3754)
2013-08-08 15:35:07,743 ERROR [STDERR] (http-127.0.0.1-80-9)    at java.util.regex.Pattern$Curly.match(Pattern.java:3744)
2013-08-08 15:35:07,744 ERROR [STDERR] (http-127.0.0.1-80-9)    at java.util.regex.Pattern$GroupTail.match(Pattern.java:4227)
2013-08-08 15:35:07,744 ERROR [STDERR] (http-127.0.0.1-80-9)    at java.util.regex.Pattern$BmpCharProperty.match(Pattern.java:3366)
2013-08-08 15:35:07,745 ERROR [STDERR] (http-127.0.0.1-80-9)    at java.util.regex.Pattern$Curly.match0(Pattern.java:3782)
2013-08-08 15:35:07,745 ERROR [STDERR] (http-127.0.0.1-80-9)    at java.util.regex.Pattern$Curly.match(Pattern.java:3744)

My strHindiText data is:

 `{\rtlch\fcs1 \af1\afs18 \ltrch\fcs0 \f1\fs18\cf21\insrsid13505584 &#2349;&#2379;&#2346;&#2366;&#2354;&#32; &#2404; \par }\pard\plain \ltrpar\s16\ql \li0\ri0\sb100\sa100\sbauto1\saauto1\sl240\slmult0\widctlpar\wrapdefault\aspalpha\aspnum\faauto\adjustright\rin0\lin0\itap0\pararsid13505584 \cbpat20 \rtlch\fcs1 \af0\afs24\alang1025 \ltrch\fcs0 \fs24\lang1033\langfe1033\cgrid\langnp1033\langfenp1033 {\rtlch\fcs1 \ab\af1\afs18 \ltrch\fcs0 \cs21\b\f1\fs18\cf21\insrsid13505584 &#2309;&#2344;&#2381;&#2357;&#2375;&#2359;&#2339;&#32;&#2325;&#2352;&#2375;&#2306;&#32; :}{\rtlch\fcs1 \af1\afs18 \ltrch\fcs0 \f1\fs18\cf21\insrsid13505584  \par &#2349;&#2379;&#2346;&#2366;&#2354;&#32;&#44;&#32;&#2350;&#2343;&#2381;&#2351;&#32;&#2346;&#2381;&#2352;&#2342;&#2375;&#2358;&#32;&#2325;&#2368;&#32;&#2352;&#2366;&#2332;&#2343;&#2366;&#2344;&#2368;&#32;&#2346;&#2381;&#2352;&#2366;&#2325;&#2371;&#2340;&#2367;&#2325;&#32;&#2360;&#2369;&#2306;&#2342`

Propably the JVM runs out of stack space for big data. Try to adjust: http://stackoverflow.com/questions/3700459/how-to-increase-to-java-stack-size — PeterMmm, Aug 08 '13 at 10:01
Your alternative paths `|` are probably causing recursive calls, resulting in the stackoverflow. Regex stuff is complicated in general, and your regex is big. I'm not surprised. — keyser, Aug 08 '13 at 10:02
I have posted the exception in detail that am getting please help me. I am stuck from two days. — Aditya, Aug 08 '13 at 10:27
I would recommend to make some test with simple regex pattern matchers to look up, what is matched and how deep / grouping etc is going on this text. — Rene M., Aug 08 '13 at 10:29
I would suggest instead of alternatives (e.g `a|b|c`) to use the alternative notation: `[abc]`, this should make the regex clearer, and you just need to escape the closing bracket and no other character. Also, it looks like you want to do something that regexes aren't good for - parsing - for something that isn't text but has a higher ordering. — Tassos Bassoukos, Aug 08 '13 at 10:33
You really shouldn't use `RegEx` for such enormous parsings.. it's not very performant, since the regex expression compiles every time you try to match a string. — Georgian, Aug 08 '13 at 11:13
From my experiences with it, the Java Pattern API is not very performant, and even a bit unstable (it crashed on some regexp that PHP/JS/Perl could perfectly handle). Consider using something else, maybe JFlex. — Giulio Franco, Aug 08 '13 at 11:29
Everything about your code is _asking_ for problems. Try breaking the problem into multiple small problems rather than trying to do a bazillion things all at once with a giant regex. Based on the regexes you're using, I'd be surprised if you _didn't_ experience memory problems. — jahroy, Aug 08 '13 at 19:32
I would personally recommend writing a parser for your RTF rather than attempting to cut it up with regex. Regex is meant for simple things, and I don't imagine RTF in Hindi is simple at all. — Shaz, Aug 08 '13 at 20:34
Here are a couple links that describe why/how regex is _not_ the correct tool for parsing RTF documents: [one](http://stackoverflow.com/a/188877/778118), [two](http://regexadvice.com/forums/permalink/87397/87402/ShowThread.aspx#87402). Notice that both people understnad the RTF spec well and advise against the use of regular expressions when parsing RTF documents. — jahroy, Aug 09 '13 at 04:04
Would suggest to break down your string, take first few characters and then keep on adding more to localize problem, I am also curious to know your purpose there might be better solutions if you explain your problem statement. — Sachin Thapa, Aug 16 '13 at 05:08
Assuming that this didn't give you memory problems, How in the world would you debug that regex in the event that it doesn't work as expected? — Cruncher, Oct 25 '13 at 12:47

score 3 · Answer 1 · edited May 23 '17 at 12:10

Option 1 - Treat the symptoms

Look for recursive calls in your regex.

If you are not sure where your problem lies: try a regex tester like this.

Option 2 - Treat the cause (much better)

Don't use a regex if there are better tools for your task.

In your case you could: Search for a RTF parsing library or write your own parser.
e.g. like the one here that jahroy pointed out in the comments.

score 1 · Answer 2 · answered Nov 27 '13 at 11:33

This is not a full answer but just for your information.

In your regex:

(-|,|/|\\(|\\)|\"|;|\\.|'|<|>|:|\\?)* can be written as [-,/()\";.'<>:?]*

Since this pattern occurs twice (in your first regex), this immediately shortens your regex by 40 characters and makes those sections much more readable.

score 0 · Answer 3 · edited Aug 09 '13 at 02:48

0

Try this to catch the error

public class Example {
    public static void endless() {
        endless();
    }

    public static void main(String args[]) {
        try {
            endless();
        } catch(StackOverflowError t) {
            // more general: catch(Error t)
            // anything: catch(Throwable t)
            System.out.println("Caught "+t);
            t.printStackTrace();
        }
        System.out.println("After the error...");
    }
}

More importantly try increasing the size of the stack add this to your regex

+'xss='xss

adding the "+" symbol changes the operator to prevent back tracking since this doesnt seem to be necessary in your case.

edited Aug 09 '13 at 02:48

jh314

27,144
16
62
82

answered Aug 09 '13 at 02:45

Bmize729

1,126
8
18

5

He should consider using the right tool for the job rather than treating the symptoms that result from using the wrong tool... – jahroy Aug 09 '13 at 02:52
1

chances are the overflow is coming from recursive issues not greediness from the regex. By making the operator possessive we can eliminate branching and recursive handling making this expression more efficient and allows for less memory usage. – Bmize729 Aug 09 '13 at 03:07
Just out of curiosity what specifically would you recommend here? – Bmize729 Aug 09 '13 at 03:13
4

I would either look for an RTF parsing library or write one myself. If I wrote one myself I would break up the parsing into small tasks rather than try to do everything at once. If I **had** to use regexes, I would keep them small and simple and make sure they only operate on small pieces of text. I would never consider feeding the entire document to a single, complicated regex. – jahroy Aug 09 '13 at 03:18
1

It took about 5 seconds of googling to find [this](http://tika.apache.org/1.2/api/org/apache/tika/parser/rtf/RTFParser.html) (maybe it will help, maybe it won't...) – jahroy Aug 09 '13 at 03:24
Fair enough, I am not a big fan of regex. I am just trying to help Aditya with their question. – Bmize729 Aug 09 '13 at 03:37
2

Ok. Sorry if my comments were overly harsh. This whole "_I must use regex_" mentality is just so common on this site that it sometimes makes you want to scream from the top of the mountain: "_not all problems must be solved with regex!_" – jahroy Aug 09 '13 at 03:47
I can understand that. Regexs are often overused but, it seems it is more of a common education issue. People forget about parsing because it is not beat into their heads like regular expressions often are. – Bmize729 Aug 09 '13 at 03:49

Stack Overflow in java regex

3 Answers3

Option 1 - Treat the symptoms

Option 2 - Treat the cause (much better)