simple html dom parser get html between elements

Question

I'm using PHP Simple HTML Dom library to get HTML from a webpage. I need fetch HTML between first tag inside 'div.page-content' and first 'h4' tag. Example:

<div class="page-content">
   First text
   <p>Second text</p>
   <div>Third text</div>
   <p>More text</p>
   <h4>Subtitle 1</h4>
   <p>bla bla</p>
   <p>bla bla</p>
   <h4>Subtitle 2</h4>
   <p>bla bla</p>
   <p>bla bla</p>
</div>

I've tried to to this:

$start = $html->find('div.page-content', 0);
while ( $next = $start->next_sibling() ) {
    if ( $next->tag == 'h4')
        break;
    else{
        echo $next->plaintext;
        echo '<br/>';
        
        $start = $next;
    }
}

But it doesnt fetch nothing.

I need to fetch all:

 First text
 <p>Second text</p>
 <div>Third text</div>
 <p>More text</p>

If you need that `First text` string, why are you starting at `div p`? That'll explicitly skip over any text before the first paragraph tag. — Mike 'Pomax' Kamermans, Mar 20 '23 at 17:21
@Mike'Pomax'Kamermans sorry, it was a mistake in copy/paste of this question. I've update the post. — ISFT, Mar 20 '23 at 17:39
@ISFT why? Someone already wrote an answer that works, without even needing a third party library. Does that not work for you? (If so, please let them know why) — Mike 'Pomax' Kamermans, Mar 21 '23 at 16:34

score 0 · Answer 1 · answered Mar 20 '23 at 17:51

I've never used the PHP Simple HTML Dom library before, but with the native DOMDocument you can do it pretty easily:

$html = <<<EOT
<body>
<div class="page-content">
   First text
   <p>Second text</p>
   <div>Third text</div>
   <p>More text</p>
   <h4>Subtitle 1</h4>
   <p>bla bla</p>
   <p>bla bla</p>
   <h4>Subtitle 2</h4>
   <p>bla bla</p>
   <p>bla bla</p>
</div>
</body>
EOT;

$doc = new DOMDocument();
$doc->loadHTML($html);

// Get our node by class name
// See https://stackoverflow.com/a/6366390/231316
$finder = new DomXPath($doc);
$classname = "page-content";
$nodes = $finder->query("//*[contains(concat(' ', normalize-space(@class), ' '), ' $classname ')]");

$buf = '';
foreach ($nodes as $node) {
    foreach ($node->childNodes as $child) {
        if ($child->nodeName === 'h4') {
            break;
        }
        $buf .= $doc->saveHTML($child);
    }
}

echo $buf;

Outputs the following, which includes whitespace:

   First text
   <p>Second text</p>
   <div>Third text</div>
   <p>More text</p>

Demo: https://3v4l.org/JWUi5

Thank you so much, but I have to do it with "simple html dom" php library — ISFT, Mar 21 '23 at 08:26

score 0 · Answer 2 · answered Jul 27 '23 at 18:23

You can modify your approach by iterating through all the child elements of div.page-content and stop when you encounter the first h4 tag. Here's a revised code snippet that should work for your case:

// Assuming you have already loaded the HTML into $html using the library.

// Find the first div.page-content
$pageContent = $html->find('div.page-content', 0);

// Initialize an empty string to store the extracted HTML
$extractedHtml = '';

// Iterate through all child elements of div.page-content
foreach ($pageContent->children() as $child) {
    // Check if the current child is an h4 tag
    if ($child->tag == 'h4') {
        break; // Stop when we encounter the first h4 tag
    } else {
        // Append the HTML of the current child to the extractedHtml
        $extractedHtml .= $child->outertext;
    }
}

// Output the extracted HTML
echo $extractedHtml;

simple html dom parser get html between elements

2 Answers2