how to grab some nodes out of a html string correctly?

Question

I try grabbing some nodes out of my given html string:

$html = <<<'HTML'
<h1>Details au&szlig;en</h1>
<h1>Schreibmappe DIN A4</h1>
<hr>
<p>Die Au&szlig;enseite [...]</p>
<p class="own-branding">[...]</p>
<p><img id="tinymce-editor-image-d52f7e72-4c4f-4cdc-86e1-5d8889bf1159" class="tinymce-editor-image tinymce-editor-image-d52f7e72-4c4f-4cdc-86e1-5d8889bf1159" src="{media path='media/image/07-7206-56_geschlossen_VS5458e3fd87895.jpg'}" alt="07-7206-56_geschlossen_VS5458e3fd87895" width="274" height="339" data-src="media/image/07-7206-56_geschlossen_VS5458e3fd87895.jpg"></p>
HTML;

I need the first h1 and the last img node inside the string.

For doing so, I used DOMDocument, because with preg_match_all or stuff like that we could missed something out.

Complete Code:

$html = <<<'HTML'
<h1>Details au&szlig;en</h1>
<h1>Schreibmappe DIN A4</h1>
<hr>
<p>Die Au&szlig;enseite [...]</p>
<p class="own-branding">[...]</p>
<p><img id="tinymce-editor-image-d52f7e72-4c4f-4cdc-86e1-5d8889bf1159" class="tinymce-editor-image tinymce-editor-image-d52f7e72-4c4f-4cdc-86e1-5d8889bf1159" src="{media path='media/image/07-7206-56_geschlossen_VS5458e3fd87895.jpg'}" alt="07-7206-56_geschlossen_VS5458e3fd87895" width="274" height="339" data-src="media/image/07-7206-56_geschlossen_VS5458e3fd87895.jpg"></p>
HTML;

$dom = new \DOMDocument();
// since the libxml was designed for ISO-8859-1, this is a backwards hack
// @see https://stackoverflow.com/questions/11309194/php-domdocument-failing-to-handle-utf-8-characters/11310258
$dom->loadHTML(iconv('UTF-8', 'ISO-8859-1', $html),
    \LIBXML_HTML_NOIMPLIED
);
$h1List = $dom->getElementsByTagName('h1');
$h1 = $h1List->item(0);
$imgList = $dom->getElementsByTagName('img');
$img = $imgList->item($imgList->length - 1);

$data = array(
    'tabTitle' => trim($dom->saveHTML($h1)),
    'tabImg' => trim($dom->saveHTML($img))
);


// remove both wrappers if empty
$imgWrapper = $img->parentNode;
$imgWrapper->removeChild($img);

if (!$imgWrapper->hasChildNodes()) {
    $imgWrapper->parentNode->removeChild($imgWrapper);
}

$h1Wrapper = $h1->parentNode;
$h1Wrapper->removeChild($h1);

if (!$h1Wrapper->hasChildNodes()) {
    $h1Wrapper->parentNode->removeChild($h1Wrapper);
}

$data['content'] = $dom->saveHTML();

var_dump($data);

Expected output:

array(
    'tabTitle' => '<h1>Details außen</h1>',
    'tabImg' => '<img id="tinymce-editor-image-d52f7e72-4c4f-4cdc-86e1-5d8889bf1159" class="tinymce-editor-image tinymce-editor-image-d52f7e72-4c4f-4cdc-86e1-5d8889bf1159" src="{media path=\'media/image/07-7206-56_geschlossen_VS5458e3fd87895.jpg\'}" alt="07-7206-56_geschlossen_VS5458e3fd87895" width="274" height="339" data-src="media/image/07-7206-56_geschlossen_VS5458e3fd87895.jpg">',
    'content' => '
<h1>Schreibmappe DIN A4</h1>
<hr>
<p>Die Au&szlig;enseite [...]</p>
<p class="own-branding">[...]</p>
<p>
'
);

But I got the following output:

array(3) {
  'tabTitle' =>
  string(501) "<h1>Details außen<h1>Schreibmappe DIN A4</h1>
<hr>
<p>Die Außenseite [...]</p>
<p class="own-branding">[...]</p>
<p><img id="tinymce-editor-image-d52f7e72-4c4f-4cdc-86e1-5d8889bf1159" class="tinymce-editor-image tinymce-editor-image-d52f7e72-4c4f-4cdc-86e1-5d8889bf1159" src="%7Bmedia%20path='media/image/07-7206-56_geschlossen_VS5458e3fd87895.jpg'%7D" alt="07-7206-56_geschlossen_VS5458e3fd87895" width="274" height="339" data-src="media/image/07-7206-56_geschlossen_VS5458e3fd87895.jpg"></p>
</h1>"
  'tabImg' =>
  string(373) "<img id="tinymce-editor-image-d52f7e72-4c4f-4cdc-86e1-5d8889bf1159" class="tinymce-editor-image tinymce-editor-image-d52f7e72-4c4f-4cdc-86e1-5d8889bf1159" src="%7Bmedia%20path='media/image/07-7206-56_geschlossen_VS5458e3fd87895.jpg'%7D" alt="07-7206-56_geschlossen_VS5458e3fd87895" width="274" height="339" data-src="media/image/07-7206-56_geschlossen_VS5458e3fd87895.jpg">"
  'content' =>
  string(108) "<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">

"
}

What's wrong here? I am using PHP 5.6. Changing to PHP 7 would be possible, if the issue is related to the PHP version.

I have never heard about that rule. In my opinion it doesn't make sense. Just imagine a site with index. The first ordered headline is the main point and you gut subpoints to it using h2 etc.. Anyway, I googled about this topic. Basically, yes, we SHOULD not. But this is not a functional break. — alpham8, May 24 '17 at 14:06

score 0 · Answer 1 · answered May 24 '17 at 14:23

This should get you stared. First I query the DOMDocument using xpath and then I use saveXML to print the node.

$dom = new DOMDocument();
$dom->loadHTML($html);

$xpath = new DOMXpath($dom);

$nodes[] = $xpath->query('//h1')[0];
$nodes[] = $xpath->query('//img')[0];

foreach ($nodes as $node) {
    echo utf8_decode($dom->saveXML($node)) . PHP_EOL;
}

This is the output for your example:

<h1>Details außen</h1>
<img id="tinymce-editor-image-d52f7e72-4c4f-4cdc-86e1-5d8889bf1159" class="tinymce-editor-image tinymce-editor-image-d52f7e72-4c4f-4cdc-86e1-5d8889bf1159" src="{media path='media/image/07-7206-56_geschlossen_VS5458e3fd87895.jpg'}" alt="07-7206-56_geschlossen_VS5458e3fd87895" width="274" height="339" data-src="media/image/07-7206-56_geschlossen_VS5458e3fd87895.jpg"/>

You can format this into the desired output

how to grab some nodes out of a html string correctly?

1 Answers1