I am looking for a regex to find the contents of the first <h3>
tag. What can I use there?

- 26,726
- 31
- 139
- 202

- 61
- 1
- 3
-
3Using Regexes for this kind of HTML parsing is generally a bad idea. See: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags Use a proper HTML parser. – FrustratedWithFormsDesigner Oct 04 '10 at 14:12
-
Don't use a regex for HTML/XML parsing. Use an HTML/XML parser. Php has a few. – sberry Oct 04 '10 at 14:12
-
2You can also use xpath for this purpose if you html is xhtml. – Skarab Oct 04 '10 at 14:12
-
2PHP has the ability to parse HTML DOMs natively - you almost certainly want to use that instead of regex. – Peter Boughton Oct 04 '10 at 14:13
-
2Thou shalt not use regular expressions to parse HTML. See e.g. http://stackoverflow.com/questions/3650125/how-to-parse-html-with-php-closed – Pekka Oct 04 '10 at 14:13
-
5I don't know why this is getting down-voted - it's a legitimate question for a newb. – FrustratedWithFormsDesigner Oct 04 '10 at 14:17
-
4I agree with the sentiments of avoiding using Regex for this, but I think all the downvotes are bit harsh -- isn't this supposed to be a site where you ask questions because you don't know how to do something? – Spudley Oct 04 '10 at 14:17
-
possible duplicate of [getting all values from h1 tags using php](http://stackoverflow.com/questions/3299033/getting-all-values-from-h1-tags-using-php) – Gordon Oct 04 '10 at 14:18
-
1*(related)* [Best methods to parse HTML](http://stackoverflow.com/questions/3577641/best-methods-to-parse-html/3577662#3577662) – Gordon Oct 04 '10 at 14:21
8 Answers
You should use php's DOM parser instead of regular expressions. You're looking for something like this (untested code warning):
$domd = new DOMDocument();
libxml_use_internal_errors(true);
$domd->loadHTML($html_content);
libxml_use_internal_errors(false);
$domx = new DOMXPath($domd);
$items = $domx->query("//h3[position() = 1]");
echo $items->item(0)->textContent;

- 33,687
- 18
- 94
- 85
Well, a simple solution would be the following:
preg_match( '#<h3[^>]*>(.*?)</h3>#i', $text, $match );
echo $match[1];
For everything more complex, you should consider using a HTML document parser though.

- 369,085
- 72
- 557
- 602
The DOM approach:
<?php
$html = '<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
<html>
<head><title></title>
</head>
<body>
<h1>Lorem ipsum<h1>
<h2>Dolor sit amet<h2>
<h3>Duis quis velit est<h3>
<p>Cras non tempor est.</p>
<p>Maecenas nec libero leo.</p>
<h3>Nulla eu ligula est</h3>
<p>Suspendisse potenti.</p>
</body>
</html>
';
$doc = new DOMDocument;
$doc->loadHTML($html);
$titles = $doc->getElementsByTagName('h3');
if( !is_null($titles->item(0)) ){
echo $titles->item(0)->nodeValue;
}
?>

- 142,137
- 41
- 261
- 360
Here's an explanation why parsing HTML with regular expressions is evil. Anyway, this is a way to do it...
$doc = new DOMDocument();
$doc->loadHTML($text);
$headings = $doc->getElementsByTagName('h3');
$heading = $headings->item(0);
$heading_value = (isset($heading->nodeValue)) ? $heading->nodeValue : 'Header not found';

- 30,570
- 21
- 75
- 112
First of all: regular expressions aren't a proper tool for parsing HTML code. However in this case, they should be good enough, cause H3
tags cannot be nested.
preg_match_all('/<h3[^>]*>(.*?)<\/h3>/si', $source, $matches);
$matches
variable should contains content from H3
tagas.

- 43,890
- 13
- 88
- 135
-
2But they can be commented out, or contains the code `
Wibble > Wobble
`, or similar. – Peter Boughton Oct 04 '10 at 14:16
Use an xpath expression like
"/html/body/h3[0]"
this will select the whole first h3 node.
Note that this will not work on ill-formed html.

- 28,510
- 21
- 92
- 151
-
1With DOM's loadHTML(), this will work fine with real world (read broken) HTML. – Gordon Oct 04 '10 at 14:40
PHP has the ability to parse HTML DOMs natively - you almost certainly want to use that instead of regex.
See this page for details: http://php.net/manual/en/book.dom.php
And check the related questions down the right hand side for people asking very similar questions.

- 110,170
- 32
- 120
- 176
preg_match("/<h3>(.*)<\/h3>/", $search_in_this_string, $put_matches_in_this_var);

- 2,280
- 14
- 19
-
2Expression here is incorrect (and using regex in general a bad idea) – Peter Boughton Oct 04 '10 at 14:16