regex match question

Question

I want to match any of these cases with a regex. I have the header text, but I need to match it with the (possible) corresponding HTML:

<h1>header title</h1>
<h2>site | header title</h2>
<h3 class="header">header title</h3>
<h2>header title 23 jan 2009</h2>
<h1>header title</h1>

I have this:

/(<(h1|h2|h3))(.+?)".$title."(.+?)(<\/\\2>)/i

But it seems to not always work, and don't see why.

Thanks

You'd better give up on regexes to parse HTML. See http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags (for example) — Manrico Corazzi, Feb 12 '10 at 14:52
What language? .NET? Java? JavaScript? PERL? Different languages have different RegEx formats, so we need to know. — Oded, Feb 12 '10 at 14:52
it's in php preg_match. I dont want to walk through the DOM since that would cause too much load (tried that). — Yvo, Feb 12 '10 at 15:12
I seem to be getting better results with a small tweak: /(<(h1|h2|h3))(.+)?".$title."(.+)?(<\/\\2>)/i God knows why :p — Yvo, Feb 12 '10 at 15:32
Here's the direct link to the famed answer Manrico is referencing: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 — Hank Gay, Feb 12 '10 at 16:51

score 4 · Answer 1 · answered Feb 12 '10 at 14:54

4

Don't use regexes to parse HTML! Use an HTML parser, instead.

answered Feb 12 '10 at 14:54

Hank Gay

70,339
36
160
222

score 0 · Answer 2 · answered Feb 12 '10 at 14:57

Is $title regex-escaped (so characters like {, [ etc. are escaped)?

With line end may be problem too; there should something like multiline support, if you regex implementation supports it.

It is better to process structured data with appropriate tools - XML with XML parser, HTML with HTML parser. There are parsers like BeautifulSoup in Python, hpricot in Ruby, libxml2...

score 0 · Answer 3 · answered Feb 12 '10 at 15:01

What you (logically) want for your example is something like:

<(group of anything not including ">"> (Value to extract) <(group of anything not including ">">

e.g.

<[^>]>([^>]+)<[^>]>

The specific regex syntax is a bit dependent on what environment you're working on.

You can get away with this if you're sure what you're parsing is no more complicated than your example. However, you really shouldn't be parsing html (or xml) with a regex (as someone has already noted here) because xml can be arbitrarily nested, and regex can't possibly deal with that.

it's in php, and actually I only want the header tags h1, h2, h3. So it would be: — Yvo, Feb 12 '10 at 15:14

regex match question

3 Answers3