stackoverflow exception while using String match in java

Question

For a little university project i'm doing, i need to extract code samples from html given as a string. To by more precise, i need to get from that html string, everything in between <code> and </code>.

I'm writing in Java, and using String.match to do that.

My code:

public static ArrayList<String> extractByHTMLtagDelimiters(String source, String startDelimiter, String endDelimiter){
ArrayList<String> results = new ArrayList<String>();
if (source.matches("([\t\n\r]|.)*" + startDelimiter + "([\t\n\r]|.)*" + endDelimiter)){
    //source has some code samples in it
    //get array entries of the form: {Some code}</startDelimiter>{something else}
    String[] splittedSource = source.split(startDelimiter);
        for (String sourceMatch : splittedSource){
        if (sourceMatch.matches("([\t\n\r]|.)*" + endDelimiter + "([\t\n\r]|.)*")){
            //current string has code sample in it (with some body leftovers)
            //the code sample located before the endDelimiter - extract it
            String codeSample = (sourceMatch.split(endDelimiter))[0];
            //add the code samples to results
            results.add(codeSample);
        }
        }
}
return results;

iv'e tried to extract that samples from some html of ~1300 chars and got pretty massive exception: (it goes on and on for few dozens of lines)

Exception in thread "main" java.lang.StackOverflowError
at java.util.regex.Pattern$Branch.match(Unknown Source)
at java.util.regex.Pattern$GroupHead.match(Unknown Source)
at java.util.regex.Pattern$Loop.match(Unknown Source)
at java.util.regex.Pattern$GroupTail.match(Unknown Source)
at java.util.regex.Pattern$BranchConn.match(Unknown Source)
at java.util.regex.Pattern$CharProperty.match(Unknown Source)
at java.util.regex.Pattern$Branch.match(Unknown Source)
at java.util.regex.Pattern$GroupHead.match(Unknown Source)
at java.util.regex.Pattern$Loop.match(Unknown Source)
at java.util.regex.Pattern$GroupTail.match(Unknown Source)
at java.util.regex.Pattern$BranchConn.match(Unknown Source)
at java.util.regex.Pattern$CharProperty.match(Unknown Source)
at java.util.regex.Pattern$Branch.match(Unknown Source)
at java.util.regex.Pattern$GroupHead.match(Unknown Source)
at java.util.regex.Pattern$Loop.match(Unknown Source)
at java.util.regex.Pattern$GroupTail.match(Unknown Source)
at java.util.regex.Pattern$BranchConn.match(Unknown Source)
at java.util.regex.Pattern$CharProperty.match(Unknown Source)
at java.util.regex.Pattern$Branch.match(Unknown Source)
at java.util.regex.Pattern$GroupHead.match(Unknown Source)
at java.util.regex.Pattern$Loop.match(Unknown Source)

i've found the following bug report: http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=5050507

is there anything i can do to still use string.match? if not, can you please recommend some other way to do it without implementing html parsing by myself?

Thank a lot, Dub.

See [What HTML parsing libraries do you recommend in Java](http://stackoverflow.com/questions/26638/what-html-parsing-libraries-do-you-recommend-in-java). — Matthew Flaschen, Apr 01 '11 at 20:08
@khachik, if you bothered to look at the bug, you would realize it was closed as "Will not fix", as it's pretty fundamental to the way the regex library was written. So upgrading won't make any difference. — Matthew Flaschen, Apr 01 '11 at 20:09
I'm useing the newest Java (i think, i updated few months ago), i just mentioned that iv'e encountered this problem in the web, and it look that in my java version it still exists. — Boris C, Apr 01 '11 at 20:11
Don't use regex to parse html [:)](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags) — MAK, Apr 01 '11 at 20:19
obligatory link: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 — Don Roby, Apr 01 '11 at 20:21
@MAK, note that the HTML is given as a string, all i could find in the web parsed HTML from the web (from a url), i'll check in a sec @matthow's link. hope the salvation is there. — Boris C, Apr 01 '11 at 20:23
Rather than alternating . with \n and \r (why add \t?) why not set the DOTALL flag in the pattern? — Neil, Apr 01 '11 at 20:31
@Dub, the better-written libraries can certainly parse Strings. For example, with JTidy you can pass a [`StringReader`](http://download.oracle.com/javase/6/docs/api/java/io/StringReader.html) to [`Tidy.parse`](http://jtidy.sourceforge.net/apidocs/org/w3c/tidy/Tidy.html#parse%28java.io.Reader,%20java.io.Writer%29). — Matthew Flaschen, Apr 01 '11 at 20:38

score 3 · Accepted Answer · answered Apr 01 '11 at 20:35

You can just manually go through the input string using String's indexOf() method to find the start and end deliminters and extract out the bits between yourself.

public static void main(String[] args) {
    String source = "<html>blah<code>this is awesome</code>more junk</html>";

    String startDelim = "<code>";
    String endDelim = "</code>";
    int start = source.indexOf(startDelim);
    int end = source.indexOf(endDelim);

    String code = source.substring(start + startDelim.length(), end);
    System.out.println(code);
}

If you need to find more than one, then just use indexOf again starting at the point you finished:

int nextStart = source.indexOf(startDelim, end + endDelim.length())

Thanks, it did the job! somehow i always forget that the simplest solution might be the best. — Boris C, Apr 01 '11 at 20:52

score 1 · Answer 2 · answered Apr 01 '11 at 20:45

1

Simply replace your regex pattern with "(?s).*"

This matches anything including new lines as you intended.

answered Apr 01 '11 at 20:45

Simon G.

6,587
25
30

Personally, I prefer the non-regex solution from wolfcastle. – Simon G. Apr 01 '11 at 20:47

stackoverflow exception while using String match in java

2 Answers2