0

I'm working with a DOM parser and I'm having issues. I'm basically trying to grab the href within the tag that only contain the class ID of 'thumbnail '. I've been trying to print the links on the screen and still get no results. Any help is appreciated. I also turned on error_reporting(E_ALL); and still nothing.

$html = file_get_contents('http://www.reddit.com/r/funny');
$dom = new DOMDocument();
@$dom->loadHTML($html);
$classId = "thumbnail ";
$div = $html->find('a#'.$classId);
echo $div;

I also tried this but still had the same result of NOTHING:

include('simple_html_dom.php');
$html = file_get_contents('http://www.reddit.com/r/funny');
$dom = new DOMDocument();
@$dom->loadHTML($html);
// grab all the on the page
$xpath = new DOMXPath($dom);
$hrefs = $xpath->evaluate("/html/body//a");
$ret = $html->find('a[class=thumbnail]');
echo $ret;
Mike 'Pomax' Kamermans
  • 49,297
  • 16
  • 112
  • 153
Zach Harvey
  • 71
  • 1
  • 10
  • 2
    `$html` is a string, not an object, so you would never be able to do `$html->`. You are mixing DOMDocument and the Simple HTML DOM parser. – nickb Dec 17 '13 at 00:27
  • I thought when I was reloading it into the DOM it was an object not a string? Correct me if I'm wrong? – Zach Harvey Dec 17 '13 at 00:29
  • Could you help me on where I went wrong with the statement? I'm new when it comes to DOM and I'm trying to understand the full functions of it. – Zach Harvey Dec 17 '13 at 00:32
  • @ZachHarvey The reason the first code isn't working is because there is no hyperlinks with id `thumbnail`. You're looking for the *class* `thumbnail` instead. – silkfire Dec 17 '13 at 00:33
  • 1
    `$hrefs` pretty much look like it contains what you want, drop that non-existing `->find()` call, and [probably drop that whole slow simple html dom thing](http://stackoverflow.com/a/3577662/358679) – Wrikken Dec 17 '13 at 00:33

4 Answers4

3

You were almost there:

<?php
$dom = new DOMDocument();
@$dom->loadHTMLFile('http://www.reddit.com/r/funny');

$xpath = new DOMXPath($dom);
$hrefs = $xpath->evaluate("/html/body//a[contains(concat(' ',normalize-space(@class),' '),' thumbnail ')]");
var_dump($hrefs);

Gives:

class DOMNodeList#28 (1) {
  public $length =>
  int(25)
}

25 matches, I'd call it success.

Wrikken
  • 69,272
  • 8
  • 97
  • 136
  • 1
    Can skip one line by using `$dom->loadHTMLFile($url)` – Phil Dec 17 '13 at 00:38
  • 1
    @Phil: duly noted. I was going for "least amount of change in the original code", but we may as well get that in there indeed, I'll edit it. – Wrikken Dec 17 '13 at 00:46
1

This code would probably work:

$html = file_get_contents('http://www.reddit.com/r/funny');
$dom = new DOMDocument();
@$dom->loadHTML($html);

$xpath = new DOMXPath($dom);
$hyperlinks = $xpath->query('//a[@class="thumbnail"]');

foreach($hyperlinks as $hyperlink) {
   echo $hyperlink->getAttribute('href'), '<br>;'
}
silkfire
  • 24,585
  • 15
  • 82
  • 105
  • 1
    Hm, I always use the `contains(concat(' ',@class,' '),' thumbnail ')` for checking whether something _has_ a class, but possibly also other classnames. – Wrikken Dec 17 '13 at 00:37
  • 1
    WINNER WINNER CHICKEN DINNER! Thank you so much silkfire! – Zach Harvey Dec 17 '13 at 00:37
  • Also it's weird Wrikken I can't do the concat for the simple reason the site I"m trying to find the images from has the class like this class="thumbnail ". An extra space at the end screwed up everything for me for a few hours! – Zach Harvey Dec 17 '13 at 00:38
  • 1
    This answer does not work on the sample URL provided. Those elements have a class attribute value of `"thumbnail "` (or `"thumbnail loggedin"` if you're a reddit user) – Phil Dec 17 '13 at 00:43
  • 1
    @ZachHarvey: can't do the concat? If it is a space, that concat thing would still just work. If however it is 'other kind of whitespace' (tabs, newlines,...), this is a bit more robust: `contains(concat(' ',normalize-space(@class),' '),' thumbnail ')` – Wrikken Dec 17 '13 at 00:51
0

if you're using simple_html_dom, why are you doing all these superfluous things? It already wraps the resource in everything you need -- http://simplehtmldom.sourceforge.net/manual.htm

include('simple_html_dom.php');

// set up:
$html = new simple_html_dom();

// load from URL:
$html->load_file('http://www.reddit.com/r/funny');

// find those <a> elements:
$links = $html->find('a[class=thumbnail]');

// done.
echo $links;
Mike 'Pomax' Kamermans
  • 49,297
  • 16
  • 112
  • 153
0

Tested it and made some changes - this works perfect too.

<?php
    // load the url and set up an array for the links
    $dom = new DOMDocument();
    @$dom->loadHTMLFile('http://www.reddit.com/r/funny');
    $links = array();

    // loop thru all the A elements found
    foreach($dom->getElementsByTagName('a') as $link) {
        $url = $link->getAttribute('href');
        $class = $link->getAttribute('class');

        // Check if the URL is not empty and if the class contains thumbnail
        if(!empty($url) && strpos($class,'thumbnail') !== false) {
            array_push($links, $url);
        }
    }

    // Print results
    print_r($links);
?>
ArendE
  • 957
  • 1
  • 8
  • 14