I'm having trouble trying to learn how to use DOM Document to scrape a webpage. I can get it to work using preg match and regex, but I've been seeing through research that it's error prone and it's better to use DOM Document. Basically I want to convert this:
function lodestone_scraper_preg() {
$html = file_get_contents('http://somepage.com/target'); //get the html returned from the following url
$pattern = '/<div\s*class="topics_list_inner">(.*?)<\/div>/s';
preg_match_all($pattern, $html, $matches, PREG_PATTERN_ORDER);
$matches = $matches[0];
$five = array_slice($matches , 0, 5);
print_r($five);
}
into something that uses the DOM Document model. So far I've come up with this:
function lodestone_scraper_dom() {
$html = file_get_contents('http://somepage.com/target');
$lodestone_doc = new DOMDocument();
if(!empty($html)){
$lodestone_doc->loadHTML($html);
$lodestone_xpath = new DOMXPath($lodestone_doc);
$lodestone_row = $lodestone_xpath->query('//div[@class="topics_list_inner"]');
if($lodestone_row->length > 0){
foreach($lodestone_row as $row){
echo $row->nodeValue;
}
}
}
}
But it only spits out the node content without any HTML. I need to have the HTML included as well. The key seems to be saveXML, but I get errors whenever I try to incorporate it. Any hints as to what I can do with this?
@Marc Tried the suggestion in the link. Unable to get any output from it.
Here's what I tried along with the DOMinnerHTML function:
function lodestone_scraper_innerHTML() {
$html = file_get_contents('http://na.finalfantasyxiv.com/lodestone/topics/'); //get the html returned from the following url
$dom = new DOMDocument();
$dom->preserveWhiteSpace = false;
$dom->formatOutput = true;
$dom->load($html);
$domTable = $dom->getElementsByTagName("div");
foreach ($domTable as $tables)
{
echo DOMinnerHTML($tables);
}
}