0

I want to find all <h3> blocks in this example:

<h3>sdf</h3>
sdfsdf
<h3>sdf</h3>
32
<h2>fs</h2>
<h3>23sd</h3>
234
<h1>h1</h1>

(From h3 to other h3 or h2) This regexp find only first h3 block

~\<h3[^>]*\>[^>]+\<\/h3\>.+(?:\<h3|\<h2|\<h1)~is

I use php function preg_match_all (Quote from docs: After the first match is found, the subsequent searches are continued on from end of the last match.)

What i have to modify in my regexp?

ps

<h3>1</h3>
1content
<h3>2</h3>
2content
<h2>h2</h2>
<h3>3</h3>
3content
<h1>h1</h1>

this content have to be parsed as:

[0] => <h3>1</h3>1content
[1] => <h3>2</h3>2content
[2] => <h3>2</h3>3content
Andy Lester
  • 91,102
  • 13
  • 100
  • 152
Andrei Nikolaev
  • 113
  • 2
  • 13
  • 9
    [Don't use regexes for parsing HTML](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454) – John Conde Apr 04 '14 at 00:45
  • not sure i really undestand your issue – jcobhams Apr 04 '14 at 00:46
  • Thanks for your answer, but I parse my own page with defined structure. – Andrei Nikolaev Apr 04 '14 at 00:47
  • change `.+` to `.+?` and replace the non capturing group with a lookahead. Note that angle brackets don't need to be escaped and slashes too since you use `~` as delimiter. – Casimir et Hippolyte Apr 04 '14 at 00:50
  • 3
    Please take a look at the [DomDocument](http://www.php.net/manual/en/class.domdocument.php) class. If you parse your HTML, you can easily query all the heading three blocks. – Dave Chen Apr 04 '14 at 00:58
  • @CasimiretHippolyte .+? skip second block – Andrei Nikolaev Apr 04 '14 at 00:59
  • @AndreiNikolaev: the second block is skipped because you didn't replace the non capturing group with a lookahead. – Casimir et Hippolyte Apr 04 '14 at 01:01
  • 1
    Questions about parsing HTML with PHP/regex come up so often in SO. Let me echo what has already been said - don't do that. There are many far more able and useful tools for this problem. Look at PHP internal classes `DOMDocument` and `DOMXPath` for starters. Make life easier for yourself :) – Darragh Enright Apr 04 '14 at 01:09

3 Answers3

1

You shouldn't use Regex to parse HTML if there is any nesting involved.

Regex

(<(h\d)>.*?<\/\2>)[\r\n]([^\r\n<]+)

Replacement

\1\3
or
$1$3

http://regex101.com/r/uQ3uC2

Vasili Syrakis
  • 9,321
  • 1
  • 39
  • 56
1

with DOMDocument:

$dom = new DOMDocument();
@$dom->loadHTML($html);

$nodes = $dom->getElementsByTagName('body')->item(0)->childNodes;

$flag = false;
$results = array();

foreach ($nodes as $node) {
    if ( $node->nodeType == XML_ELEMENT_NODE &&
         preg_match('~^h(?:[12]|(3))$~i', $node->nodeName, $m) ):
        if ($flag)
            $results[] = $tmp;
        if (isset($m[1])) {
            $tmp = $dom->saveXML($node);
            $flag = true;
        } else
            $flag = false;

    elseif ($flag):
        $tmp .= $dom->saveXML($node);

    endif;
}

echo htmlspecialchars(print_r($results, true));

with regex:

preg_match_all('~<h3.*?(?=<h[123])~si', $html, $matches);

echo htmlspecialchars(print_r($matches[0], true));
Casimir et Hippolyte
  • 88,009
  • 5
  • 94
  • 125
0
preg_match_all('/<h3>(.*?)<\/h3>/is', $stringHTML, $matches);
Minh Nguyen
  • 490
  • 1
  • 3
  • 8