cURL Page and Navigate DOM

Question

I am simply trying to retrieve a pages' title with the script below. However, I am doing something wrong because I keep getting this error:

PHP Fatal error:  Call to a member function getElementsByTagName() on a non-object in /Users/robertquinn/Desktop/SCRAPE/asu.php on line 22

This is my first time using curl functions so please let me know if I am horribly screwing something up here. is getElementsByTagName() soley an XML DOM method?

<?php  

    function get_data($url) {

        $userAgent = 'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.1.2) Gecko/20090729 Firefox/3.5.2 GTB5';
        $ch = curl_init();
        curl_setopt($ch,CURLOPT_COOKIE,"someCookie=2127;onlineSelection=C");
        curl_setopt ($ch, CURLOPT_COOKIEFILE, 'cookie.txt');
        curl_setopt($ch, CURLOPT_COOKIEJAR, 'cookie.txt');
        curl_setopt($ch, CURLOPT_USERAGENT, $userAgent);
        curl_setopt($ch, CURLOPT_URL, $url);
        curl_setopt($ch, CURLOPT_FAILONERROR, true);
        curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
        curl_setopt($ch, CURLOPT_AUTOREFERER, true);
        curl_setopt($ch, CURLOPT_RETURNTRANSFER,true);
        curl_setopt($ch, CURLOPT_TIMEOUT, 100);
        $html = curl_exec($ch);
        curl_close($ch);

        $doc = new DOMDocument();
        $body = $doc->loadHTML( $html );
        $title_value = $body->getElementsByTagName('title')->nodeValue;

        echo $title_value;
    }

    get_data('http://www.someurl.com');

    ?>

try to print out `echo $html;` just to see if you get any result. — Mihai Iorga, Aug 18 '12 at 00:27

Lawrence Cherone · Accepted Answer · 2012-08-18T00:38:33.947

3

Change you DOMDocument part too:

 $doc = new DOMDocument();
 $doc->loadHTML( $html );
 //Suppress strict errors or you could just suppress errors directly e.g: @$doc->loadHTML( $html );
 $doc->strictErrorChecking = false;

 $title_value = $doc->getElementsByTagName('title')->item(0)->nodeValue;

edited Aug 18 '12 at 00:38

answered Aug 18 '12 at 00:30

Lawrence Cherone

46,049
7
62
106

This works and returns the title value. However, I also get a bunch of errors referencing this line `$doc->loadHTML( $html );` --> http://i.imgur.com/zcUG3.png Any ideas? – flyingarmadillo Aug 18 '12 at 00:35
Yeah it happens, Check update or [look at this question](http://stackoverflow.com/questions/7082401/avoid-domdocument-xml-warnings-in-php), hope it helps – Lawrence Cherone Aug 18 '12 at 00:38

score 0 · Answer 2 · answered Aug 18 '12 at 06:40

0

I like using simple_html_dom is really simple, for example I would just do

...
$page = curl_exec($ch);

$html = str_get_html($page);
echo $html->find('title', 0)->plaintext;

answered Aug 18 '12 at 06:40

r-sal

1,169
8
9

I got it to work anyway, but I dont think this would work for me because the site I am trying to access does not allow bots and requires cookie storage to authenticate its users. Thus, I need to manage a session, modify my user_agent, and set my cookies with my script. – flyingarmadillo Aug 18 '12 at 06:58
yea it isn't always the best option, but you can also use it by just passing the html as a string to it. So for example you could use curl to fetch and maintain a session and just pass the fetched pages to simple_html_dom. – r-sal Aug 18 '12 at 20:24

cURL Page and Navigate DOM

2 Answers2