-2

Below is my code:

$xpath = new DOMXPath($doc);
// Start from the root element
$query = '//div[contains(@class, "hudpagepad")]/div/ul/li/a';
$nodeList = @$xpath->query($query);

// The size is 104
$size = $nodeList->length;

for ( $i = 1; $i <= $size; $i++ ) {
    $node = $nodeList->item($i-1);
    $url = $node->getAttribute("href");

    $error = scrapeURL($url);
}

function scrapeURL($url) {
    $cfm = new DOMDocument();
    $cfm->loadHTMLFile($url);
    $cfmpath = new DOMXPath($cfm);
    $pointer = $cfm->getElementById('content-area');
    $filter = 'table/tr';

    // The problem lies here    
    $state = $pointer->firstChild->nextSibling->nextSibling->nodeValue;

    $nodeList = $cfmpath->query($filter, $pointer);
}

Basically this traverses to a list of links and scrapes each link with the scrapeURL method.

I don't know the problem here but randomly i get an non-object type error trying to get the $pointer and sometimes it passes through without any error and the values are correct.

Anyone knows the problem here? I'm guessing that the point when the problem occurs is when the page is not loaded properly?

JohnnyQ
  • 4,839
  • 6
  • 47
  • 65
  • Don't suppress errors (`@` operator)... especially when things aren't working right. "I'll just ignore this broken leg and go run a marathon... hey! why can't I run properly?" Beyond that, DOM is **VERY** sensitive to malformed html and will refuse to parse documents that otherwise work "ok" in browsers. – Marc B Jun 11 '12 at 03:56
  • Thank you @MarcB. But it doesn't solve the problem that the `$pointer` randomly it throws a non-object error. I have created a log file to monitor the execution but there are times, for example, at 41th link the `$pointer` throws a non-object error and when I execute it again the 41th link does not throw any error instead other links get the symptoms, for example 88th link. It randomly occurs and I can't seem to understand it's behaviour. Does matter because it's a .cfm file and not a .html file, I don't think so. – JohnnyQ Jun 11 '12 at 04:51
  • You can't chain objects and expect none of the in-betweens to be non-null, especially with a structure that you know is not fixed – Ja͢ck Jun 11 '12 at 09:30
  • possible duplicate of [PHP HTML DomDocument getElementById problems](http://stackoverflow.com/questions/3391942/php-html-domdocument-getelementbyid-problems) - at least your answer suggests that. – hakre Jun 12 '12 at 12:33

1 Answers1

0

I found the idea of the answer here:

http://sharovatov.wordpress.com/2009/11/01/php-loadhtmlfile-and-a-html-file-without-doctype/

it is better to use a 'manual' query than using getElementById coz it breaks if the DOCTYPE of the document your about to load is not well formed.

so use this instead:

$cfmpath->query("//*[@id='content-area']")

or create a method

function getElementById($id) {
    global $dom;
    $xpath = new DOMXPath($dom);
    return $xpath->query("//*[@id='$id']")->item(0);
}

Thank you for those who attempted to help!

JohnnyQ
  • 4,839
  • 6
  • 47
  • 65