RegExp works in JS and PHP but not in Java

Question

I have a regexp to extract an id and a label out of an HTML source code. It can be found HERE.

As you can see it work fine and its fast but when i try this regexp in java with the same source code it 1. Takes for ever and 2. only matches one string (from the first a to the last a is one match).

I tried it with the Multiline flag on and off but no difference. I don't understand how a regexp can work everywhere but in java. Any ideas?

private static final String COURSE_REGEX = "<a class=\"list-group-item list-group-item-action \" href=\"https:\\/\\/moodle-hs-ulm\\.de\\/course\\/view\\.php\\?id=([0-9]*)\"(?:.*\\s){7}<span class=\"media-body \">([^<]*)<\\/span>";

Pattern pattern = Pattern.compile(COURSE_REGEX, Pattern.MULTILINE);
Matcher matcher = pattern.matcher(sourceCode);
List<String> courses = new ArrayList<>();

while(matcher.find() && matcher.groupCount() == 2){
    courses.add(matcher.group(1) + "(" + matcher.group(2) + ")");
}

[Don't parse HTML with RegEx](https://stackoverflow.com/a/1732454/7008354)! (Hint: It is evil, as seen in the linked answer) — Tobias F., Jan 22 '19 at 10:14

score 2 · Accepted Answer · answered Jan 22 '19 at 10:34

Your regex is running into catastrophic backtracking because of the gargantuan number of possible permutations the subexpression (?:.*\s){7} needs to check (because the . can also match spaces). Java aborts the match attempt after a certain number of steps (not sure how many, certainly > 1.000.000). PHP or JS may not be so cautious.

If you simplify that part of your regex to .*?, you do get the matches:

"(?s)<a class=\"list-group-item list-group-item-action \" href=\"https://moodle-hs-ulm\\.de/course/view\\.php\\?id=([0-9]*)\".*?<span class=\"media-body \">([^<]*)</span>"

Note that you need the DOTALL flag ((?s), so . may match a newline) instead of the MULTILINE flag which changes the behavior of ^ and $ anchors (none of which your regex is using).

Also note that you don't need to escape slashes in a Java regex.

This solution is not very robust because .*? is rather unspecific. I suppose your previous attempt of (?:.*\\s){7} may have been designed to match no more than 7 lines of text? In that case, you could use (?:(?!</a>).)* instead to ensure that you don't cross over into the next <a> tag. That's one of the dangers of parsing HTML with regex :)

Finally, greetings from a staff member of the faculty of Informatics at your university :)

Thank you very much. Although I will follow the recommendation Tobias gave me to use a parser the regexp works now. I will read the article about backtracking to better understand regexp i hope this will prevent such errors in the future :). What a coincident that someone of my university and of the same faculty answered my question. Made my day ^^ — Nimmi, Jan 22 '19 at 10:41

RegExp works in JS and PHP but not in Java

1 Answers1