I'm currently attempting to parse some HTML in Java using RegEx. On smaller test samples of the live code it works, but when trying it against the live code the regex engine breaks the stack.
Here is the code and the RegEx I'm using.
/**
* RegEx Explanation:
* "(?i)" - Turn on case insensitive mode
* "<BR><BR><B>.+?</B><BR>" - Match the format for a group name
* "(?-i)" - Turn off case insensitive mode
* "(.|\\r|\\n)" - Match all the text following the group name incl. newlines
* "(?=((?i)<BR><BR><B>.+?</B><BR>(?-i))" - and lookahead for the start of a new group, make the match lazy and use case-insensitive mode
* "+?)" - Make the lookahead lazy, close out the capture group.
*/
Pattern filterPattern =
Pattern.compile("(?i)(<BR><BR><B>.+?</B><BR>)(?-i)(.|\\r|\\n)+?(?=((?i)<BR><BR><B>.+?</B><BR>(?-i))+?)");
Matcher match = filterPattern.matcher(content);
ArrayList<String> groups = new ArrayList<String>();
// Retrieve the matches found by the RegEx
while(match.find()) {
if(match.groupCount() > -1) {
groups.add(match.group(0));
}
}
The live html is a board list (http://menu.2ch.net/bbsmenu.html), but the general format is:
<br><br><b>Group name</b><br>
<a href="board url">Name of the board</a><br>
This is repeated a number of times with varying number of links. I avoided using a regular HTML parser like JSoup simply because the format was consistent and easier to target with RegEx in a first pass to extract the sections.
The stack overflow occurs when I call group(). Other questions stated that this is due to the group() call in Java having no bounds on recursion depth so it'll run till it hits the stack limit. I'm not very good at RegEx, which may be why I'm missing a potentially easier expression. I suspect the recursion problem is occurring at the alternation (.|\r|\n), but it could be just as easily occurring due to too many groups. I don't know.
Is there a better expression to avoid the catastrophic recurssion?