0

I want to match any of these cases with a regex. I have the header text, but I need to match it with the (possible) corresponding HTML:

<h1>header title</h1>
<h2>site | header title</h2>
<h3 class="header">header title</h3>
<h2>header title 23 jan 2009</h2>
<h1>header title</h1>

I have this:

/(<(h1|h2|h3))(.+?)".$title."(.+?)(<\/\\2>)/i

But it seems to not always work, and don't see why.

Thanks

Gumbo
  • 643,351
  • 109
  • 780
  • 844
Yvo
  • 1
  • 4
    You'd better give up on regexes to parse HTML. See http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags (for example) – Manrico Corazzi Feb 12 '10 at 14:52
  • What language? .NET? Java? JavaScript? PERL? Different languages have different RegEx formats, so we need to know. – Oded Feb 12 '10 at 14:52
  • it's in php preg_match. I dont want to walk through the DOM since that would cause too much load (tried that). – Yvo Feb 12 '10 at 15:12
  • I seem to be getting better results with a small tweak: /(<(h1|h2|h3))(.+)?".$title."(.+)?(<\/\\2>)/i God knows why :p – Yvo Feb 12 '10 at 15:32
  • Here's the direct link to the famed answer Manrico is referencing: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 – Hank Gay Feb 12 '10 at 16:51

3 Answers3

4

Don't use regexes to parse HTML! Use an HTML parser, instead.

Hank Gay
  • 70,339
  • 36
  • 160
  • 222
0

Is $title regex-escaped (so characters like {, [ etc. are escaped)?

With line end may be problem too; there should something like multiline support, if you regex implementation supports it.

It is better to process structured data with appropriate tools - XML with XML parser, HTML with HTML parser. There are parsers like BeautifulSoup in Python, hpricot in Ruby, libxml2...

Messa
  • 24,321
  • 6
  • 68
  • 92
0

What you (logically) want for your example is something like:

<(group of anything not including ">"> (Value to extract) <(group of anything not including ">">

e.g.

<[^>]>([^>]+)<[^>]>

The specific regex syntax is a bit dependent on what environment you're working on.

You can get away with this if you're sure what you're parsing is no more complicated than your example. However, you really shouldn't be parsing html (or xml) with a regex (as someone has already noted here) because xml can be arbitrarily nested, and regex can't possibly deal with that.

Steve B.
  • 55,454
  • 12
  • 93
  • 132
  • it's in php, and actually I only want the header tags h1, h2, h3. So it would be:

    – Yvo Feb 12 '10 at 15:14