java, regexp and simple html nested: unable to get inside text

Question

I've a strange behaviour with regexp pattern matching

The regexp is that:

String regexp = "<h3.*>(.*)</h3>";

I've a first case:

<h3 class="pubAdTitleBlock">Title</h3>

In this case, all is ok, matcher.group(1) give me the 'Title'

I a Second case, i've a link nested into h3, like this:

<h3 class="pubAdTitleBlock "><a href="myLink" title="title">Title</a></h3>

This is the Problem

In this case - matcher.find() is true, - matcher.group(0) is the full string, - but matcher.group(1) is an empty string

why ?

I need to extract title inside <h3 ..>title</h3>, and inside <h3 ...><a ...>title</a></h3>

[Don't use a regexp to parse HTML, use an HTML parser.](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags) — Denys Séguret, Sep 11 '12 at 13:38
I can't. The rest of text is simple text ... It's a long story, and I'd like to understand the behaviour of regexp, before eventually skip it ! — realtebo, Sep 11 '12 at 13:41
@dystroy Almost. Your are looking for [this](http://stackoverflow.com/a/1732454/647772). — , Sep 11 '12 at 13:41
@dystroy, no, i definetively are looking only for a quick and dirty solution, not for your frustration ! I understand your post. I've read it already in the past and it's one of best reply in the world, but I'd like a suggestion, and in fact I obtained a solution. Sorry — realtebo, Sep 11 '12 at 13:53
@realtebo No worry. I understand your question, this was just a warning as it wasn't clear at all you weren't trying to do more. Note that nobody downvoted you ;) — Denys Séguret, Sep 11 '12 at 13:59
@dystroy: ok, it's clear! Your comment was ok, and the linked one is very very very very funny ! — realtebo, Sep 11 '12 at 14:00

score 4 · Answer 1 · answered Sep 11 '12 at 13:43

4

<h3.*> captures <h3 class="pubAdTitleBlock "><a href="myLink" title="title"> because the regexp uses greedy matching algorithm by default. You need to use question mark after the * if you want it to stop after the first match on >. Try this: <h3.*?>(.*)</h3>

answered Sep 11 '12 at 13:43

Alex Vayda

6,154
5
34
50

Actually, I would recommend to use XPath instead of RegExp when extracting data from XML-like structures. If your HTML is not necessarily well-formed XML document there are tools which can convert HTML to XHTML on which you can run XPath expressions. – Alex Vayda Sep 11 '12 at 13:52
To use xpath i need jtidy, but i cannot use it, sorry. – realtebo Sep 11 '12 at 13:54

score 3 · Accepted Answer · answered Sep 11 '12 at 13:43

3

The first .* will capture " class="pubAdTitleBlock "><a href="myLink" title="title">Title</a", leaving only the zero-width space between </a> and </h3> for the capturing group.

You'll want to change it to something like [^>]* (i.e. "anything except >").

answered Sep 11 '12 at 13:43

skunkfrukt

1,550
1
13
22

Your solution is simple and clear. I voted for the best. Belove there is a reply from myself telling the solution I've found, thanks to you. – realtebo Sep 11 '12 at 13:57

score 2 · Answer 3 · edited Jun 20 '20 at 09:12

The answer to this is the "greedyness" of regular expressions. Take the "greater than" character in your regex:

<h3.*>(.*)</h3>
     ^this one

You expect that this will match against the end of the opening h3 tag, which would result in your capture group to contain everything inside the h3 tag, just as the first example does.

Regexes are greedy though, meaning they try to consume as much of the text as possible. That results in the first part of your regex, which is

<h3.*>

to match against this whole section:

<h3 class="pubAdTitleBlock "><a href="myLink" title="title">Title</a>

Note that the matched string ends with the same character as your regex (>). The group now captures the remaining text between this > and the </h3>, which is an empty string.

There are 3 solutions that fit.

Use an xml parser and then use xpath to get the content of the h3 tag (a lot of overhead because of external libraries etc., but an absolute must-have for bigger projects)
Make the *-operator non-greedy by appending a ?, making the regex <h3.*?>(.*)</h3>. Look here for more info.
Modify the regex to explicitly start capturing as soon as the h3 tag (and no other tag!) closes by making it: <h3[^>]*>(.*)</h3>

Hope this helps!

Yes, you explained a lot at me, thanks. Sorry, I cannot install external library in this case (it's a long history to talk... ) — realtebo, Sep 11 '12 at 13:55

score 0 · Answer 4 · answered Sep 11 '12 at 13:50

0

Thanks to Namida Aneskans, the solution was:

String regexp = "<h3[^>]*>(<a[^>]*>)?([^<]+)(</a>)?</h3>";

So the first and the third group can be empty, but the second is always the title, thanks !

answered Sep 11 '12 at 13:50

realtebo

23,922
37
112
189

java, regexp and simple html nested: unable to get inside text

4 Answers4