1

I am trying to extract from a webpage which has the following markup

<div id="div">
    content
    content
    content
    content
</div>

The regex I currently have is

Pattern div = Pattern.compile("<div id=\"div\">(.*?)</div>");

This works when there is only one line but with new lines it doesn't recognise stuff inside the div tag..

Any help will be grateful (I am using java by the way)

nhahtdh
  • 55,989
  • 15
  • 126
  • 162
  • 1
    Enable multiline mode, pattern.compile(" – milan Jan 18 '12 at 22:17
  • 1
    @milan, no, [MULTILINE](http://docs.oracle.com/javase/6/docs/api/java/util/regex/Pattern.html#MULTILINE) isn't needed, only [DOTALL](http://docs.oracle.com/javase/6/docs/api/java/util/regex/Pattern.html#DOTALL). – Bart Kiers Jan 18 '12 at 22:20
  • wasn't sure, wrote of the top of my head, thanks – milan Jan 18 '12 at 22:28

4 Answers4

3

Personally, I would strongly discourage you from using regular expressions in this case. It is well documented as being a bad idea to attempt to suck information out of an HTML document with regular expressions. Take a look at a proper HTML parser instead!

Community
  • 1
  • 1
ninesided
  • 23,085
  • 14
  • 83
  • 107
  • 1
    I think [this](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454) is the definitive answer about parsing HTML with regular expressions. – Keith Thompson Jan 18 '12 at 22:22
  • @KeithThompson - aah there it is, I couldn't find it! Will add to my answer, thank you. – ninesided Jan 18 '12 at 22:23
1

I think, this should work (you need to add the DOTALL modifier):

Pattern div = Pattern.compile("<div id=\"div\">(.*?)</div>", Pattern.DOTALL);
Bart Kiers
  • 166,582
  • 36
  • 299
  • 288
1

The fact that it doesn't work when there are line breaks is because . (DOT) does not match any type of line break character. To let . match line breaks as well, do:

Pattern.compile("<div id=\"div\">(.*?)</div>", Pattern.DOTALL)

or:

Pattern.compile("<div id=\"div\">([\\s\\S]*?)</div>")

or:

Pattern.compile("(?s)<div id=\"div\">(.*?)</div>")

See: http://docs.oracle.com/javase/6/docs/api/java/util/regex/Pattern.html#DOTALL

Bart Kiers
  • 166,582
  • 36
  • 299
  • 288
-1

You could add the Pattern.Multiline option

Pattern div = Pattern.compile("<div id=\"div\">(.*?)</div>", Pattern.MULTILINE);

or add the ?m operator in your reg ex ( at the end)

Hope this helps

legrandviking
  • 2,348
  • 1
  • 22
  • 29