Dom and xpath query for html parsing

Question

I'm trying to write a robot that will be fetching html parsing it daily. Now for parsing html i could use just string functions like explode, or regural expressions, but I found the dom xpath code much cleaner, so now I can make a configuration of all the sites I have to spider and tags I have to strip out like:

'http://examplesite.com' => '//div/a[@class="articleDesc"]/@href'

So the code looks like this

    @$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
$tags = $xpath->query('//body/div[@class="articleDesc"]');


foreach ($tags as $tag) 
    echo $tag->nodeValue . "\n";

So with this I get all the div tags with class article description, which i great. But I noticed that all the html tags inside the div tag are stripped out. I wonder how would I get the whole contents of that div I'm looking at.

I also find it hard to see any proper documentation for $xpath->query() to see how to form the string. The php site doesn't tell much about the exact formation of it. Still, my main problem i

See http://stackoverflow.com/questions/2087103/innerhtml-in-phps-domdocument — Michael Berkowski, Nov 20 '11 at 22:28
and its counterpart http://stackoverflow.com/questions/5404941/php-domdocument-outerhtml-for-element/5404962#5404962 — Gordon, Nov 20 '11 at 22:30
Nope, doesn't work for me. The function DOMinnerHTML($element) that's in the link doesnt work for my xpath object — Tadej Magajna, Nov 20 '11 at 22:37
Good XPath tutorial: http://schlitt.info/opensource/blog/0704_xpath.html — Matthew Turland, Nov 26 '11 at 03:54

score 2 · Accepted Answer · answered Nov 26 '11 at 04:08

2

The simple answer is:

foreach ($tags as $tag) 
    echo $dom->saveXML($tag);

If you want html unstripped a tags, the xpath would be

//a[@class="articleDesc"]

That's assuming the a tags have that class attribute

answered Nov 26 '11 at 04:08

pguardiario

53,827
19
119
159

score 1 · Answer 2 · answered Nov 21 '11 at 09:30

1

Try using http://www.php.net/manual/en/simplexmlelement.asxml.php

Or, alternative:

function getNodeInnerHTML(DOMNode $oNode)   {
  $oDom = new DOMDocument();
  foreach($oNode->childNode as $oChild) {
    $oDom->appendChild($oDom->importNode($oChild, true));
  }
  return $oDom->saveHTML();
}

answered Nov 21 '11 at 09:30

Sjaak Trekhaak

4,906
30
39

meh.. that would work in a way, but the perfect way for me would be to get from 'http://examplesite.com' => '//div/a[@class="articleDesc"]/@href' a list of html unstripped strings for the elements matching... I wonder how I'd do that – Tadej Magajna Nov 21 '11 at 11:14
I might get you wrong here, but doesn't that just require you to get the innerHTML, using one of the functions above, of the parent element matching your XPath? – Sjaak Trekhaak Nov 21 '11 at 13:03
I think not.... inner html of the parent element matching xpath would return all the html inside it. However, I'd like to get all the div tags that have class article desc for instance... – Tadej Magajna Nov 22 '11 at 16:56
So `echo getNodeInnerHTML($tag)` is not what you were looking for? If so, I'm having trouble understanding exactly what you want. Is it possible to show an example of your input, and the desired output? – Sjaak Trekhaak Nov 23 '11 at 11:26

score 0 · Answer 3 · answered Nov 25 '11 at 17:40

0

This should load all of the inner tags as well. While its not DOM they are interchangeable. And later you can dom_import_simplexml tobring it back into DOM.

$xml=simplexml_load_string($html);
$tags=$xml->xpath('//body/div[@class="articleDesc"]');

answered Nov 25 '11 at 17:40

mseancole

1,662
4
16
26

giver an error. expath doesn't work with $xml. if I try to $xml = dom_import_simplexml($xml) prior to second line it doesn't work either – Tadej Magajna Nov 25 '11 at 20:06
Exact error would be helpful. The first line imports the `$html` string into simplexml, if its not a string try `simplexml_load_file` instead. The second line is copied directly from yours but converted for simplexml. Admittedly I have not run it myself, but this is the same code I use at work, and it works for me there. `dom_import_simplexml($tags)` should only be used after the simplexml has been loaded and assuming you have something you want to do with it in DOM, otherwise it is not necessary, just included in case you wanted to switch back to DOM after loading the results. – mseancole Nov 25 '11 at 22:45
simplexml_load_string($html) returns false and after I put that into xpath() it breaks of course... it also giver a lot of warnings like: Warning: simplexml_load_string() [function.simplexml-load-string]: Entity: line 36: parser error : Opening and ending tag mismatch: META line 8 and HEAD in /usr/share/nginx/html/synd/robots/robot.php on line 25 I know the html may not be perfect which may be the cause of simplexml returning false, but it is a proper html webpage wtich gets rendered in browser – Tadej Magajna Nov 26 '11 at 00:16
From the sounds of it your html isn't well formed. Which, while not necessary for it to show up in the browser properly, it is if you wish to use any kind of parser on it. Try closing your meta and head tags and try again. Meta tags are self-closing so just add a forward slash to the end of them, that's easy enough to forget. Once your html is well formed it should work. – mseancole Nov 26 '11 at 02:00

score 0 · Answer 4 · answered Nov 26 '11 at 16:58

0

You could use this awesome spider framework (in Python) Scrapy

answered Nov 26 '11 at 16:58

Lao

191
3

Dom and xpath query for html parsing

4 Answers4