2

I have a regexp to extract an id and a label out of an HTML source code. It can be found HERE.

As you can see it work fine and its fast but when i try this regexp in java with the same source code it 1. Takes for ever and 2. only matches one string (from the first a to the last a is one match).

I tried it with the Multiline flag on and off but no difference. I don't understand how a regexp can work everywhere but in java. Any ideas?

private static final String COURSE_REGEX = "<a class=\"list-group-item list-group-item-action \" href=\"https:\\/\\/moodle-hs-ulm\\.de\\/course\\/view\\.php\\?id=([0-9]*)\"(?:.*\\s){7}<span class=\"media-body \">([^<]*)<\\/span>";

Pattern pattern = Pattern.compile(COURSE_REGEX, Pattern.MULTILINE);
Matcher matcher = pattern.matcher(sourceCode);
List<String> courses = new ArrayList<>();

while(matcher.find() && matcher.groupCount() == 2){
    courses.add(matcher.group(1) + "(" + matcher.group(2) + ")");
}
Nimmi
  • 181
  • 1
  • 2
  • 10
  • 3
    [Don't parse HTML with RegEx](https://stackoverflow.com/a/1732454/7008354)! (Hint: It is evil, as seen in the linked answer) – Tobias F. Jan 22 '19 at 10:14

1 Answers1

2

Your regex is running into catastrophic backtracking because of the gargantuan number of possible permutations the subexpression (?:.*\s){7} needs to check (because the . can also match spaces). Java aborts the match attempt after a certain number of steps (not sure how many, certainly > 1.000.000). PHP or JS may not be so cautious.

If you simplify that part of your regex to .*?, you do get the matches:

"(?s)<a class=\"list-group-item list-group-item-action \" href=\"https://moodle-hs-ulm\\.de/course/view\\.php\\?id=([0-9]*)\".*?<span class=\"media-body \">([^<]*)</span>"

Note that you need the DOTALL flag ((?s), so . may match a newline) instead of the MULTILINE flag which changes the behavior of ^ and $ anchors (none of which your regex is using).

Also note that you don't need to escape slashes in a Java regex.

This solution is not very robust because .*? is rather unspecific. I suppose your previous attempt of (?:.*\\s){7} may have been designed to match no more than 7 lines of text? In that case, you could use (?:(?!</a>).)* instead to ensure that you don't cross over into the next <a> tag. That's one of the dangers of parsing HTML with regex :)

Finally, greetings from a staff member of the faculty of Informatics at your university :)

Tim Pietzcker
  • 328,213
  • 58
  • 503
  • 561
  • Thank you very much. Although I will follow the recommendation Tobias gave me to use a parser the regexp works now. I will read the article about backtracking to better understand regexp i hope this will prevent such errors in the future :). What a coincident that someone of my university and of the same faculty answered my question. Made my day ^^ – Nimmi Jan 22 '19 at 10:41