0

I have a hard time figuring out how I can get multiple headings and the first paragraph for that heading. In this case I only need the h3 titles and the following paragraph for each.

Example code

function everything_in_tags($string, $tagname)
{
    $pattern = "#<\s*?$tagname\b[^>]*>(.*?)</$tagname\b[^>]*>#s";
    preg_match($pattern, $string, $matches);
    return $matches[1];
}
$tagname = "h3";

$string = "<h1>This is my title</h1>

<p>This is a text right under my h1 title.</p>
<p>This is some more text under my h1 title</p>

<h2>This is my level 2 heading</h2>
<p>This is text right under my level 2 heading</p>

<h3>First h3</h3>
<p>First paragraph for the first h3</p>

<h3>Second h3</h3>
<p>First paragraph for the second h3</p>

<h3>Third h3</h3>
<p>First paragraph for the third h3</p>
<p>Second paragraph for the third h3</p>

<h2>This is my level 2 heading</h2>
<p>This is text right under my level 2 heading</p>";

//OUTPUT: First h3
echo everything_in_tags($string, $tagname);

I would like to implement a foreach loop - but that requires that the above is working as expected.

foreach ($headings as $heading && $paragraphs as $paragraph) {
    echo "<h3>".$heading."</h3>";
    echo "<p>".$paragraph."</p>";
}

//Expected output:
//<h3>First h3</h3>
//<p>First paragraph for the first h3</p>

//<h3>Second h3</h3>
//<p>First paragraph for the second h3</p>

//<h3>Third h3</h3>
//<p>First paragraph for the third h3</p>

So in above example I can get the first h3. But after a lot of reading, I can't seem to find out how to get all the h3's and the first paragraphs for each as well.

If anyone can point me in the right direction and explain to me how to do this I would really appreciate it. Thank you.

Niels Hermann
  • 611
  • 2
  • 6
  • 13

1 Answers1

1

There is an obligatory defacto answer to this, and it is to not use RegEx for HTML. There are exceptions for controlled HTML, or where mistakes/bugs don't really matter, but generally, I would agree with that, and instead I'd point you at a DOM-aware thing where you could express things like HTML tags and the concept of "next".

Here's a sample that works, although you'll probably need to tweak where I'm dumping.

<?php

$html = <<<TAG
<h1>This is my title</h1>

<p>This is a text right under my h1 title.</p>
<p>This is some more text under my h1 title</p>

<h2>This is my level 2 heading</h2>
<p>This is text right under my level 2 heading</p>

<h3>First h3</h3>
<p>First paragraph for the first h3</p>

<h3>Second h3</h3>
<p>First paragraph for the second h3</p>

<h3>Third h3</h3>
<p>First paragraph for the third h3</p>
<p>Second paragraph for the third h3</p>

<h2>This is my level 2 heading</h2>
<p>This is text right under my level 2 heading</p>
TAG;


$dom = new DomDocument();
// Load the HTML, don't worry about it being a fragment
$dom->loadHTML($html, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);

$xpath = new DOMXPath($dom);

// Grab all H3 tags. This might need to be adjusted if there's more to the depth
$results = $xpath->query("//h3");
foreach ($results as $result) {
    var_dump(sprintf('<h3>%1$s</h3>', $result->textContent));
    
    // See if the next element is a P tag
    $next = $result->nextElementSibling;
    if ($next && 'p' === $next->nodeName) {
        var_dump(sprintf('<p>%1$s</p>', $next->textContent));
    }
}

Output:

string(17) "<h3>First h3</h3>"
string(39) "<p>First paragraph for the first h3</p>"
string(18) "<h3>Second h3</h3>"
string(40) "<p>First paragraph for the second h3</p>"
string(17) "<h3>Third h3</h3>"
string(39) "<p>First paragraph for the third h3</p>"

Demo here: https://3v4l.org/gvBrv

Chris Haas
  • 53,986
  • 12
  • 141
  • 274
  • Thank you very much Chris - works like a charm. I didn't know about this. Performance wise - will this be good as well? I have tried to run it like 10.000 times with microtime and it completes on average in 0.0001s. So that's fast - locally. But if I have 5000 pageviews per hour - will this run as smoothly - in your opinion? – Niels Hermann Dec 21 '21 at 10:44
  • 1
    That averages out to about 1.38 requests per second, or probably 2 to 5 at peak, and I would be surprised if you noticed this. If you have a large DOM, dozens or hundreds of MB, it might be felt. – Chris Haas Dec 21 '21 at 13:19
  • Hello again @Chris. I hve a question. Above code worked perfectly on my localhost. But on my webserver I get this warning: `Notice: Undefined property: DOMElement::$nextElementSibling`. So basically it works for the h3 and display all these, but the following paragraph does not show. Do you know how to fix this? Thank you. – Niels Hermann Jan 14 '22 at 10:47
  • Inspect the `nodeType` property on `$result`, if it is `1`, you have an Element as expected, but if it is [something else](https://www.php.net/manual/en/dom.constants.php) you could be on a CDATA, Text or something else. – Chris Haas Jan 14 '22 at 12:38