extract image elements from html

Question

I am trying to get the image tag out of html codes.

I have

   $parser=new DOMDocument;   

   $parser->loadHTML($this->html);
        foreach($parser->getElementsByTagName('img') as $imgNode){
         echo $parser->saveHTML($imgNode);
       }

$this->html contains massive html code and javascripts.

for example:

<div id='someid'>
<button id='bt' onclick='clickme()'>click me</button>
<img src='test.jpg'/>
.....
.....
more...

</div>

<div>
.....
.....
more...

I got an warning saying

DOMDocument::loadHTML(): htmlParseEntityRef: expecting ';' in Entity,

I am not sure how to fix this and don't know if there are a better way to extract all the images from the massive html codes.

Any ideas? Thanks a lot!

score 2 · Accepted Answer · edited May 23 '17 at 10:31

2

I am in no way an expert on these matters (yet), but I hope this helps in some way.

According to this answer by troelskn you can make the DOM parser more tolerant to badly formed HTML by using libxml_use_internal_errors. That might help you getting rid of that error.

Parsing all images of a document can be done by using DOMXPath. It takes a DOMDocument as a parameter and lets you run XPath queries on the document.

$document = new DOMDocument();
$document->loadHTML($your_html);

// Suppress parse errors.
libxml_use_internal_errors(false);

$xpath = new DOMXPath($document)

// Find all img tags.
$img_nodes = $xpath->query('//img')

DOMXPath::query returns a DOMNodeList which can be looped through using DOMNodeList::item, which returns a DOMNode.

for($i = 0; $i > $img_nodes->length; $i++)
{
    $node = $img_nodes->item($i);
    // Manipulate the node.
}

Disclaimer: The code I posted is untested and was put together using the manual.

edited May 23 '17 at 10:31

Community

1
1

answered Feb 02 '13 at 02:55

thordarson

5,943
2
17
36

"you can make the DOM parser more tolerant to badly formed HTML by using libxml_use_internal_errors"--wrong! This simply silences the errors. `loadHTML()` is already tolerant of html errors, although in a nonstandard way. – Francis Avila Feb 02 '13 at 03:29
@FrancisAvila Upping the threshold of which something complains about a problem makes it more tolerant, wouldn't you say? – thordarson Feb 02 '13 at 03:35
Saying "more tolerant" implies different parsing behavior, not different error reporting. Also the errors are still collected (by libxml), just not immediately sent to PHP's error-reporting layer, so arguably it's not "more tolerant" by your standard either. – Francis Avila Feb 02 '13 at 03:40
@FrancisAvila Take [pain tolerance](http://en.wikipedia.org/wiki/Pain_tolerance) for example. Even though a person wont shout out in pain, the pain might still be there, according to the [pain threshold](http://en.wikipedia.org/wiki/Pain_threshold). So even if an individual experiences pain (read: the error is there), he might not feel the need to cry out (read: report an error) about it. Back to my answer, sure you could use that function to collect the errors later, but I'm using it to suppress them. – thordarson Feb 02 '13 at 03:45

extract image elements from html

1 Answers1