0

solution: this works:

String p="<pre>[\\\\w\\\\W]*</pre>";

I want to match and capture the enclosing content of the <pre></pre> tag tried the following, not working, what's wrong?

String p="<pre>.*</pre>";

        Matcher m=Pattern.compile(p,Pattern.MULTILINE|Pattern.CASE_INSENSITIVE).matcher(input);
        if(m.find()){
            String g=m.group(0);
            System.out.println("g is "+g);
        }
user121196
  • 30,032
  • 57
  • 148
  • 198
  • 2
    Seriously, you shouldn't be parsing HTML with regular expressions. Use a library such as [TagSoup](http://mercury.ccil.org/~cowan/XML/tagsoup/) instead. – Joey May 08 '10 at 00:20
  • here we go again ... did you try a search? how about this guidance - http://stackoverflow.com/questions/2400623/if-youre-not-supposed-to-use-regular-expressions-to-parse-html-then-how-are-htm – Bert F May 08 '10 at 00:25
  • 1
    `[\\\\w\\\\W]` will match a backslash, `w` or `W`. You probably meant `[\\w\\W]`, but you don't need to do that. Just use the DOTALL flag, as I said in my answer. That other trick is used a lot in JavaScript because JS has no equivalent for the DOTALL flag. – Alan Moore May 08 '10 at 01:10

3 Answers3

4

Regex is in fact not the right tool for this. Use a parser. Jsoup is a nice one.

Document document = Jsoup.parse(html);
for (Element element : document.getElementsByTag("pre")) {
    System.out.println(element.text());
}

The parse() method can also take an URL or File by the way.


The reason I recommend Jsoup is by the way that it is the least verbose of all HTML parsers I tried. It not only provides JavaScript like methods returning elements implementing Iterable, but it also supports jQuery like selectors and that was a big plus for me.

Community
  • 1
  • 1
BalusC
  • 1,082,665
  • 372
  • 3,610
  • 3,555
3

You want the DOTALL flag, not MULTILINE. MULTILINE changes the behavior of the ^ and $, while DOTALL is the one that lets . match line separators. You probably want to use a reluctant quantifier, too:

String p = "<pre>.*?</pre>";
Alan Moore
  • 73,866
  • 12
  • 100
  • 156
  • 1
    If there's more than one `
    ` element, a greedy `.*` will match from the first opening `
    ` to the last closing `
    `. The reluctant (or non-greedy) `.*?` will stop at the first closing tag.
    – Alan Moore May 08 '10 at 01:03
1
String stringToSearch = "H1 FOUR H1 SCORE AND SEVEN YEARS AGO OUR FATHER...";

// the case-insensitive pattern we want to search for
Pattern p = Pattern.compile("H1", Pattern.CASE_INSENSITIVE);

Matcher m = p.matcher(stringToSearch);

// see if we found a match
int count = 0;
while (m.find())
    count++;

System.out.println("H1 : "+count);   
Aakash Goplani
  • 1,150
  • 1
  • 19
  • 36