1

For a little university project i'm doing, i need to extract code samples from html given as a string. To by more precise, i need to get from that html string, everything in between <code> and </code>.

I'm writing in Java, and using String.match to do that.

My code:

public static ArrayList<String> extractByHTMLtagDelimiters(String source, String startDelimiter, String endDelimiter){
ArrayList<String> results = new ArrayList<String>();
if (source.matches("([\t\n\r]|.)*" + startDelimiter + "([\t\n\r]|.)*" + endDelimiter)){
    //source has some code samples in it
    //get array entries of the form: {Some code}</startDelimiter>{something else}
    String[] splittedSource = source.split(startDelimiter);
        for (String sourceMatch : splittedSource){
        if (sourceMatch.matches("([\t\n\r]|.)*" + endDelimiter + "([\t\n\r]|.)*")){
            //current string has code sample in it (with some body leftovers)
            //the code sample located before the endDelimiter - extract it
            String codeSample = (sourceMatch.split(endDelimiter))[0];
            //add the code samples to results
            results.add(codeSample);
        }
        }
}
return results;

iv'e tried to extract that samples from some html of ~1300 chars and got pretty massive exception: (it goes on and on for few dozens of lines)

Exception in thread "main" java.lang.StackOverflowError
at java.util.regex.Pattern$Branch.match(Unknown Source)
at java.util.regex.Pattern$GroupHead.match(Unknown Source)
at java.util.regex.Pattern$Loop.match(Unknown Source)
at java.util.regex.Pattern$GroupTail.match(Unknown Source)
at java.util.regex.Pattern$BranchConn.match(Unknown Source)
at java.util.regex.Pattern$CharProperty.match(Unknown Source)
at java.util.regex.Pattern$Branch.match(Unknown Source)
at java.util.regex.Pattern$GroupHead.match(Unknown Source)
at java.util.regex.Pattern$Loop.match(Unknown Source)
at java.util.regex.Pattern$GroupTail.match(Unknown Source)
at java.util.regex.Pattern$BranchConn.match(Unknown Source)
at java.util.regex.Pattern$CharProperty.match(Unknown Source)
at java.util.regex.Pattern$Branch.match(Unknown Source)
at java.util.regex.Pattern$GroupHead.match(Unknown Source)
at java.util.regex.Pattern$Loop.match(Unknown Source)
at java.util.regex.Pattern$GroupTail.match(Unknown Source)
at java.util.regex.Pattern$BranchConn.match(Unknown Source)
at java.util.regex.Pattern$CharProperty.match(Unknown Source)
at java.util.regex.Pattern$Branch.match(Unknown Source)
at java.util.regex.Pattern$GroupHead.match(Unknown Source)
at java.util.regex.Pattern$Loop.match(Unknown Source)

i've found the following bug report: http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=5050507

is there anything i can do to still use string.match? if not, can you please recommend some other way to do it without implementing html parsing by myself?

Thank a lot, Dub.

matt b
  • 138,234
  • 66
  • 282
  • 345
Boris C
  • 669
  • 2
  • 7
  • 14
  • 1
    See [What HTML parsing libraries do you recommend in Java](http://stackoverflow.com/questions/26638/what-html-parsing-libraries-do-you-recommend-in-java). – Matthew Flaschen Apr 01 '11 at 20:08
  • 1
    @khachik, if you bothered to look at the bug, you would realize it was closed as "Will not fix", as it's pretty fundamental to the way the regex library was written. So upgrading won't make any difference. – Matthew Flaschen Apr 01 '11 at 20:09
  • @Matthew: you are right. – khachik Apr 01 '11 at 20:10
  • I'm useing the newest Java (i think, i updated few months ago), i just mentioned that iv'e encountered this problem in the web, and it look that in my java version it still exists. – Boris C Apr 01 '11 at 20:11
  • 3
    Don't use regex to parse html [:)](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags) – MAK Apr 01 '11 at 20:19
  • obligatory link: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 – Don Roby Apr 01 '11 at 20:21
  • @MAK, note that the HTML is given as a string, all i could find in the web parsed HTML from the web (from a url), i'll check in a sec @matthow's link. hope the salvation is there. – Boris C Apr 01 '11 at 20:23
  • Rather than alternating . with \n and \r (why add \t?) why not set the DOTALL flag in the pattern? – Neil Apr 01 '11 at 20:31
  • @Dub, the better-written libraries can certainly parse Strings. For example, with JTidy you can pass a [`StringReader`](http://download.oracle.com/javase/6/docs/api/java/io/StringReader.html) to [`Tidy.parse`](http://jtidy.sourceforge.net/apidocs/org/w3c/tidy/Tidy.html#parse%28java.io.Reader,%20java.io.Writer%29). – Matthew Flaschen Apr 01 '11 at 20:38

2 Answers2

3

You can just manually go through the input string using String's indexOf() method to find the start and end deliminters and extract out the bits between yourself.

public static void main(String[] args) {
    String source = "<html>blah<code>this is awesome</code>more junk</html>";

    String startDelim = "<code>";
    String endDelim = "</code>";
    int start = source.indexOf(startDelim);
    int end = source.indexOf(endDelim);

    String code = source.substring(start + startDelim.length(), end);
    System.out.println(code);
}

If you need to find more than one, then just use indexOf again starting at the point you finished:

int nextStart = source.indexOf(startDelim, end + endDelim.length())
wolfcastle
  • 5,850
  • 3
  • 33
  • 46
1

Simply replace your regex pattern with "(?s).*"

This matches anything including new lines as you intended.

Simon G.
  • 6,587
  • 25
  • 30