0

I'm having trouble trying to learn how to use DOM Document to scrape a webpage. I can get it to work using preg match and regex, but I've been seeing through research that it's error prone and it's better to use DOM Document. Basically I want to convert this:

function lodestone_scraper_preg() {

$html = file_get_contents('http://somepage.com/target'); //get the html returned from the following url

$pattern = '/<div\s*class="topics_list_inner">(.*?)<\/div>/s';
preg_match_all($pattern, $html, $matches, PREG_PATTERN_ORDER);
$matches = $matches[0];
$five = array_slice($matches , 0, 5);

print_r($five);
}

into something that uses the DOM Document model. So far I've come up with this:

function lodestone_scraper_dom() {

$html = file_get_contents('http://somepage.com/target'); 
$lodestone_doc = new DOMDocument();

if(!empty($html)){ 
    $lodestone_doc->loadHTML($html);
    $lodestone_xpath = new DOMXPath($lodestone_doc);
    $lodestone_row = $lodestone_xpath->query('//div[@class="topics_list_inner"]');

    if($lodestone_row->length > 0){
        foreach($lodestone_row as $row){
          echo $row->nodeValue;
        }
    }
}
}

But it only spits out the node content without any HTML. I need to have the HTML included as well. The key seems to be saveXML, but I get errors whenever I try to incorporate it. Any hints as to what I can do with this?

@Marc Tried the suggestion in the link. Unable to get any output from it.

Here's what I tried along with the DOMinnerHTML function:

function lodestone_scraper_innerHTML() {

$html = file_get_contents('http://na.finalfantasyxiv.com/lodestone/topics/'); //get the html returned from the following url

$dom = new DOMDocument();
$dom->preserveWhiteSpace = false;
$dom->formatOutput       = true;
$dom->load($html);

$domTable = $dom->getElementsByTagName("div");

foreach ($domTable as $tables)
{
    echo DOMinnerHTML($tables);
}
}
Orophen
  • 31
  • 4
  • 1
    If you're looking for a `innerHTML` equivalent, then you want this: http://stackoverflow.com/questions/2087103/innerhtml-in-phps-domdocument – Marc B Feb 19 '14 at 15:13
  • same result i found when searching. http://stackoverflow.com/questions/2087103/innerhtml-in-phps-domdocument – ElefantPhace Feb 19 '14 at 15:14
  • I tried this, even in the context of just getting it to spit out anything at all. I'm getting no output, so either I'm making a mistake or don't know how to use it properly. – Orophen Feb 19 '14 at 18:42
  • `DOMDocument` is meant for well-formed XML documents, I would not recommend it for "scraping" as a good amount of web pages contain some form of error / invalid syntax. I recommend using a HTML5 parser. – Dean Taylor Feb 19 '14 at 23:10
  • @Dean Taylor - Dom can also be used with html, there seems to be some confusion about loadHTML, loadXML, saveHTML, and saveXML. Obviously 2 are for html, the other 2 xml. – pguardiario Feb 20 '14 at 00:24
  • @pguardiario There are HTML5 specific features including new parsing rules oriented towards flexible parsing and compatibility; not based on SGML. An HTML5 parser / spec ensures that what is "parsed" is the same regardless of browser and invalid HTML. `DOMDocument` does not provide this currently. – Dean Taylor Feb 20 '14 at 03:42
  • @Dean Taylor, DomDocument actually does a pretty good job with invalid HTML. A HTML5 mode would be nice but I've never missed it. – pguardiario Feb 20 '14 at 04:18

0 Answers0