0

Following code prints nothing. What am i doing wrong? Regexp tester myregexp says that regular expression is correct.

page = "<div id=\"foo\" class=\"foo\" style=\"background-image: url(foo.jpg); width: 320px; height: 245px\">\n" +
                    "  <a href=\"foo\" onclick=\"return bar('foo', 'foo', {foo: bar, foo: bar}, foo)\"></a>\n" +
                    "</div>";

Pattern pattern = Pattern.compile("<div.*?</div>");
Matcher matcher = pattern.matcher(page);
while (matcher.find()) {
    System.out.println(matcher.start() + " " + matcher.end());
}
  • 1
    Consider using jsoup for parsing html: https://jsoup.org/ – Frederic Klein Nov 17 '16 at 07:31
  • 8
    [Don't parse HTML using regex](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags), this is NOT the right tool for the job. As for your question, it probably doesn't work because it's multiline. – Nir Alfasi Nov 17 '16 at 07:32
  • Thanks for advice. I'm already using jsoup, but my input html has some incorrect structure, so jsoup doesn't work either. – Alexander Vtyurin Nov 17 '16 at 07:33
  • alfasin, multiline html, so what? I thought .*? will do the work. – Alexander Vtyurin Nov 17 '16 at 07:34
  • This is a duplicate of [Match multiline text using regular expression](http://stackoverflow.com/questions/3651725/match-multiline-text-using-regular-expression). – Wiktor Stribiżew Nov 17 '16 at 08:01

1 Answers1

1

By default, . in a regex does not match newlines. This means that your regex cannot match the </div> because the newline before it doesn't match ..

You should replace your compile line with:

Pattern pattern = Pattern.compile("<div.*?</div>",Pattern.DOTALL);

But as was noted in the comments, except in simple cases where you have control over the structure of the HTML (no comments, no Javascript, etc.), you should parse HTML with an HTML parser like JSoup, not using a regex.

RealSkeptic
  • 33,993
  • 7
  • 53
  • 79