HTML DOMNodelist?

Question

I tried using the following code for a HTML page, but it doesn't work. How do I retrieve and manipulate all outputted HTML elements in one page?

$doc = new DOMDocument;
$doc->load('http://localhost/foo/index.php');

$items = $doc->getElementsByTagName('img');

foreach ($items as $item) {
    echo $item->nodeValue . "\n";
}

EDIT:

$dom = new DOMDocument;
$html = 'http://localhost/foo/index.php';
$dom->loadHTML($html);
foreach ($dom->getElementsByTagName('a') as $node) {
    echo $dom->saveHtml($node), PHP_EOL;

}

The code above outputs nothing

Debugging Code:

<?php

$dom = new DOMDocument;
$html = 'http://localhost/foo/index.php';

var_dump($dom->loadHTML($html));

echo '<br />';

var_dump($dom);

echo '<br />';

var_dump($dom->saveHTML());

echo '<br />';

var_dump($dom->getElementsByTagName('a'));

echo '<br />';

foreach ($dom->getElementsByTagName('a') as $node) {

    var_dump($node);

    echo '<br />';

    var_dump( $dom->saveHtml($node) );
    echo '<br />';

}

?>

Debugging Result:

bool(true)
object(DOMDocument)#1 (0) { }
string(170) "

http://localhost/foo/index.php
"
object(DOMNodeList)#2 (0) { }

what are you trying to manipulate/output? `img` elements are empty elements, hence they dont have a nodeValue. — Gordon, Jun 16 '12 at 10:17
All elements or image elements? And what do you want to manipulate? Changing the tagname? Removing an attribute? Changing the URL of the image? — hakre, Jun 16 '12 at 10:35
I originally only wanted to count the img tags in the page, but if you told me how to remove the attrs, change url, etc. That'd be really nice. — siaooo, Jun 16 '12 at 10:51
@MarcoLeonardoYamin all of that has been asked and answered before. See http://stackoverflow.com/questions/3820666/grabbing-the-href-attribute-of-an-a-element/3820783#3820783 and [my other DOM answers](http://stackoverflow.com/search?q=user%3A208809+DOM) for a start. — Gordon, Jun 16 '12 at 11:05
$dom = new DOMDocument; $html = 'http://localhost/foo/index.php'; $dom->loadHTML($html); foreach ($dom->getElementsByTagName('a') as $node) { echo $dom->saveHtml($node), PHP_EOL; } The code above outputs nothing — siaooo, Jun 16 '12 at 11:18
"Changing" URLs (which involves resolving to base-paths) has been discussed here: [problem with adding root path using php domdocument](http://stackoverflow.com/questions/7442292/problem-with-adding-root-path-using-php-domdocument) — hakre, Jun 16 '12 at 11:19
@MarcoLeonardoYamin: You know how you can do basic debugging? Take a look at the [`var_dump` function](http://php.net/var_dump). — hakre, Jun 16 '12 at 11:19
And in case you use a function you're not fluent with, just re-read it's manual page: http://php.net/manual/en/domdocument.loadhtml.php - *(tip)* Most manual pages have a nice "See Also" section at the end. — hakre, Jun 16 '12 at 11:24
Well, start with the first variable you use and then continue on until you find the cause of your problem. That's called debugging. You need to do that, because we can't do that "by question and answer", you need to do that on your own. — hakre, Jun 16 '12 at 11:47
@MarcoLeonardoYamin: Psst, try `var_dump($dom->saveHTML());` after the `var_dump($dom);` line as well. ;) — hakre, Jun 16 '12 at 12:02
As you can see, you did not load the document from an URL *but* you created a document that is the text of the URL. — hakre, Jun 16 '12 at 12:08
@MarcoLeonardoYamin: There is no `` element in the document at all. Your document is just this: `
http://localhost/foo/index.php
` - no `A` element at all. — hakre, Jun 16 '12 at 12:22
Maybe your question is: "How load a URL like `http://localhost/foo/index.php` as a HTML DOMDocument?" ?? — hakre, Jun 16 '12 at 12:23
Read the manual page of [`DOMDocument::loadHTML`](http://php.net/DOMDocument.loadHTML) again. It tells you it loads a string, not the URL. Then scroll down to the **See Also** part and pick the function that does what you want to do. — hakre, Jun 16 '12 at 12:36
@hakre Thank you so much for your help, I don't know how to repay you since all you did was just commenting. So, I'll take your answer as the answer then. Thanks again. — siaooo, Jun 16 '12 at 12:53
@MarcoLeonardoYamin: You're welcome. I suggest you improve your debugging skills a bit and then you'll become a master of DOMDocument. Just take a little care with the details and you're fine. — hakre, Jun 16 '12 at 13:24

score 3 · Accepted Answer · answered Jun 16 '12 at 12:20

Some DOMDocument debugging hints.

If applicable upgrade to the latest PHP 5.4 because it will give you more information on var_dump for DOMDocument and friends.

I take your small example and will add some hints how to debug the code:

$dom = new DOMDocument;
$html = 'http://localhost/foo/index.php';
$dom->loadHTML($html);
foreach ($dom->getElementsByTagName('a') as $node) {
    echo $dom->saveHtml($node), PHP_EOL;
}

Did the loading work? That is this line:

$dom->loadHTML($html);

You can take a look inside the document by outputting it's content. If you display that in the browser you need to look into the source of your document or you just change the output with htmlspecialchars:

var_dump(htmlspecialchars($dom->saveHTML()));

This will give you the documented as loaded in the HTML variant verbatim inside your browser.

The next part you might want to debug is the result of getElementsByTagName:

foreach ($dom->getElementsByTagName('a') as $node) {

First assign it to a variable, and then check the length if it's not NULL or FALSE:

$aTags = $dom->getElementsByTagName('a');
var_dump(htmlspecialchars($aTags), $aTags->length());

The length will tell you how many elements were matched.

Example/Demo:

<?php

$dom = new DOMDocument;
$html = 'http://localhost/foo/index.php';
$dom->loadHTML($html);
echo 'Document HTML loaded: ', var_dump($dom->saveHTML()), "\n";
$aTags = $dom->getElementsByTagName('a');
echo 'A Elements found: ', var_dump($aTags->length), "\n";
foreach ($aTags as $node) {
   echo $dom->saveHtml($node), "\n";
}

Output:

Document HTML loaded: string(171) "<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body><p>http://localhost/foo/index.php</p></body></html>
"

A Elements found: int(0)

Hope this is helpful.

score 1 · Answer 2 · edited Jun 16 '12 at 11:07

1

Use PHP Simple HTML DOM Parser

if you have the images under a div under body, you would say :

$html->find('body',0)->find('div[id=foo]',0)->find('img')->src;

This is just an example, but you can do alot more things using this class.

Refer to its manual at

http://simplehtmldom.sourceforge.net/manual.htm

edited Jun 16 '12 at 11:07

Gordon

312,688
75
539
559

answered Jun 16 '12 at 09:51

Eswar Rajesh Pinapala

4,841
4
32
40

1

why replace perfectly good wheels (ext/DOM) with a wooden crutch (simplehtmldom)? – Gordon Jun 16 '12 at 10:18
1

Marco Leonardo Yamin wanted to manipulate elements , and i thought simplehtmldom would be easier to user, in addition to this, It helps you manipulate HTML elements. The class is not limited to valid HTML; it can also work with HTML code that did not pass W3C validation. Document objects can be found using selectors, similar to those in jQuery. You can find elements by ids, classes, tags, and much more. DOM elements can also be added, deleted or altered. and this DOES NOT mean that the above can be done only by simplehtmldom and not by DOM lib itself. – Eswar Rajesh Pinapala Jun 16 '12 at 10:51
ext/Dom isnt limited to valid HTML either. True, it doesnt support Selectors, but XPath is more powerful anyway. Also, selectors are supported in phpQuery and Zend_DOM which build on ext/dom. IMO those are much better third party libraries in terms of speed and memory consumption. Just have a look at the SimpleHtmlDom sourcecode. It's hackware. – Gordon Jun 16 '12 at 11:11
Gordon, I agree its a hackware. Also prone to memory leaks. I suggested my personal opinion to the user. Also thanks for referring me to phpQuery and Zend_DOM. I will check them out. – Eswar Rajesh Pinapala Jun 16 '12 at 11:14
1

you can find more suggestions in http://stackoverflow.com/questions/3577641/best-methods-to-parse-html/3577662#3577662 – Gordon Jun 16 '12 at 11:19

HTML DOMNodelist?

2 Answers2