0

I'm currently attempting to parse some HTML in Java using RegEx. On smaller test samples of the live code it works, but when trying it against the live code the regex engine breaks the stack.

Here is the code and the RegEx I'm using.

/**
 *  RegEx Explanation:  
 *      "(?i)" - Turn on case insensitive mode
 *      "<BR><BR><B>.+?</B><BR>" - Match the format for a group name
 *      "(?-i)" - Turn off case insensitive mode
 *      "(.|\\r|\\n)" - Match all the text following the group name incl. newlines 
 *      "(?=((?i)<BR><BR><B>.+?</B><BR>(?-i))" - and lookahead for the start of a new group, make the match lazy and use case-insensitive mode
 *      "+?)" - Make the lookahead lazy, close out the capture group.
 */
Pattern filterPattern = 
   Pattern.compile("(?i)(<BR><BR><B>.+?</B><BR>)(?-i)(.|\\r|\\n)+?(?=((?i)<BR><BR><B>.+?</B><BR>(?-i))+?)");
   Matcher match = filterPattern.matcher(content);

   ArrayList<String> groups = new ArrayList<String>();
   // Retrieve the matches found by the RegEx
   while(match.find()) {
     if(match.groupCount() > -1) {
        groups.add(match.group(0));
     }
   }

The live html is a board list (http://menu.2ch.net/bbsmenu.html), but the general format is:

<br><br><b>Group name</b><br>
<a href="board url">Name of the board</a><br>

This is repeated a number of times with varying number of links. I avoided using a regular HTML parser like JSoup simply because the format was consistent and easier to target with RegEx in a first pass to extract the sections.

The stack overflow occurs when I call group(). Other questions stated that this is due to the group() call in Java having no bounds on recursion depth so it'll run till it hits the stack limit. I'm not very good at RegEx, which may be why I'm missing a potentially easier expression. I suspect the recursion problem is occurring at the alternation (.|\r|\n), but it could be just as easily occurring due to too many groups. I don't know.

Is there a better expression to avoid the catastrophic recurssion?

Jeremy
  • 119
  • 3
  • 11
  • 3
    [Uh oh...](http://stackoverflow.com/a/1732454/129570) – Oliver Charlesworth Jan 05 '14 at 12:20
  • It is widely considered that though RegEx is powerful, parsing HTML requires an HTML parser. See [this answer](http://stackoverflow.com/a/1732454/418556) for an amusing expression of that thought.. OK - what @OliCharlesworth said.. – Andrew Thompson Jan 05 '14 at 12:20
  • 4
    More seriously, the fact that you have this horrific regex that *still* doesn't work ought to demonstrate that using a proper parser is the correct approach here (or at least, that regex is not the best approach). – Oliver Charlesworth Jan 05 '14 at 12:22
  • As I commented in TwisterRob's post. This page has nothing that would make using an html parser or XPath's easier. There isn't organization in the page. No ids, classes, divs, or spans. It's 100s of siblings in a single body tag divided by simple text. – Jeremy Jan 05 '14 at 13:30
  • That is very true, I just checked, I couldn't write a JQuery selector to match the links "below" the categories. However the categories are possible to match via `$("br + br + b")`, where `+` is the `following-sibling` XPath axis. You would need some trick to emulate a "between" selector. – TWiStErRob Jan 05 '14 at 13:43
  • On the other hand if you have the DOM in memory, you can just find a category `x` and then do something like: `while(!isCategory(x.next())) {link = x.next();}`, you can move around logically in the DOM tree using any Java construct, I suggest you give it a try. More powerful than any "selector" language. – TWiStErRob Jan 05 '14 at 13:46

1 Answers1

2

As they said in the comments, use a parser where you can access the DOM via XPath or similar!

To help you with regex, consider (none of these are tested, just from memory):

(?is)(?:<BR><BR><B>(.+?)</B><BR>)(.*?)(?=<BR><BR><B>.+?</B><BR>)

or

(?i)(?:<BR><BR><B>(.+?)</B><BR>)\s*((?:<A HREF=.*?>.*?</A><BR>\s*)+)
  • ((?i)blah(?-i)) is hard to read, use (?i:blah) instead, that's how it is designed, or use just one (?i), because you don't want any specific case-sensitivity in your regex anyway
  • If you turn on DOTALL mode, you can write .*? and it'll match "hello\n\r\tworld" as well.
  • Regarding (.|\\r|\\n)+? and the lookahead: you don't need to repeat the next heading, you're just looking for the next one's existense
  • You can use group(1) and group(2) to find heading and links respectively if you use non-capturing groups: (?:I'm not a numbered group).
  • If the format was consistent, you could express it instead of using (.|\\r|\\n)+?: <A HREF=(.*?)>(.*?)</A><br>\n
  • The first option I gave may break at the last Category, but I think so does yours, because you're using +?, maybe a (?=...)? to make it an optional match, but I'm not sure it will work. Use the second option, that's more to the point!
nhahtdh
  • 55,989
  • 15
  • 126
  • 162
TWiStErRob
  • 44,762
  • 26
  • 170
  • 254
  • 1
    The reason why the HTML parser isn't a simpler choice is because there is no organization in the page. There are no id's. There are no classes. No spans, no divs, nothing to order the actual page. Its simply raw text and links in a body tag. I've tried this once before using a parser but it turned into a giant state machine. I'll try your solution out. Thanks. – Jeremy Jan 05 '14 at 13:28
  • I tried out both methods and they work. I appreciate that you answered the issue with the RegEx even if everyone is appalled at the idea of attempting to apply it to HTML. I will note that I DO use the JSoup parser on the HTML matched by the regex to exract the group names and link info. Thanks for pointing out the DOTALL mode too, I wasn't aware of that. – Jeremy Jan 05 '14 at 13:46
  • I've been there... that's why I answered. I tried to use an XML parser, but it failed, so I used regex to fix the HTML into an XML (kind of a tidy) and then parse, it wasn't nice. (non-production of course, just hacking around :) – TWiStErRob Jan 05 '14 at 13:50