Parsing multiple links through a PHP DOM to find classes

Question

I'm trying to use a DOM parser for multiple links and then compare for 2 pairs of values. Could someone help me on where I went wrong? Is it not possible for me to do the comparison for the @class="badge-item-img"? EDIT I should mention that the first foreach works but when trying to find the second one there is no results shown.

<?php
// Init the '$url_array' array.
$url_array = array();
$url_array[] = 'http://www.reddit.com/r/funny';
$url_array[] = 'http://www.9gag.com/';

// Init the return '$ret' array.
$ret = array();

// Roll through the '$url_array' array.
foreach ($url_array as $url_value) {
  $html = file_get_contents($url_value);
  $dom = new DOMDocument();
  $dom2 = new DOMDocument();
  @$dom->loadHTML($html);

  $xpath = new DOMXPath($dom);
  $xpath2 = new DOMXPath($dom2);
  $hyperlinks = $xpath->evaluate('//a[@class="thumbnail "]');
  $hyperlinks2 = $xpath2->evaluate('//a[@class="badge-item-img"]');

  foreach($hyperlinks as $hyperlink) {
    if(strpos($hyperlink->getAttribute('href'), 'http://i.imgur.com/') !== FALSE){
      $ret[] = "<img style='padding-left:30%' width=\"500\" src=\"" . $hyperlink->getAttribute('href') . "\" alt=\"\" />"
             . "<br>"
             . "<br>"
             . "<br>"
             ;

    }
    foreach($hyperlinks2 as $hyperlinker) {
            $ret[] = "<img style='padding-left:30%' width=\"500\" src=\"" . $hyperlinker->getAttribute('href') . "\" alt=\"\" />"
             . "<br>"
             . "<br>"
             . "<br>"
             ;
    }
  } 
  }
// Roll through the '$ret' array.
foreach($ret as $ret_value) {
  echo $ret_value;

If you are not looking for setup a hacking server, I would suggest try to use frontend js to get the result you want, such as: GreaseMonkey. Server side html parsing HTML is not very reliable, also what if the target site uses js to dynamically populate content, your php cannot run js anyway. — Zac, Dec 18 '13 at 02:38
I like your name! I'm not looking to run any illegal sites. The sites I'm using I know do not use JS for population of content. — Zach Harvey, Dec 18 '13 at 02:42
What means - comparison for the @class="badge-item-img"; Do you try to find duplication of images? — Zac, Dec 18 '13 at 02:53
What's up with the `@`? Please read [this answer](http://stackoverflow.com/questions/1148928/disable-warnings-when-loading-non-well-formed-html-by-domdocument-php/17559716#17559716) for a better approach. — Ja͢ck, Dec 18 '13 at 03:00
I don't know who downvoted this question but I don't know why. I did A LOT of research before posting and I believed it was a question stackoverflow could help me with... I was right. I don't care if you don't like the question... Search this site there is nothing around on here that would answer my question. — Zach Harvey, Dec 18 '13 at 03:19

score 0 · Answer 1 · answered Dec 18 '13 at 02:52

The code you sent appears to be missing the following line:

@$dom2->loadHTML($html);

... I am not sure about xPath searching but it might also have trouble if there are multiple classes in the HTML for a single entity Yes is valid XHTML.

I would also suggest just storing the URLs in your first loop and adding the presentation information in your presentation loop.

foreach($ret as $ret_value) {
  echo '<img style="padding-left:30%" width="500" src="' . $ret_value . '"  alt="" /><br /><br /><br />';
}

Doesn't seem to load the results for number 2 still. I can't use the single for each as I need to filter out results first. — Zach Harvey, Dec 18 '13 at 02:57

score 0 · Accepted Answer · answered Dec 18 '13 at 03:08

I fix the bug, now you can pull image from 9gag

<?php
// Init the '$url_array' array.
$url_array = array();
$url_array['http://www.reddit.com/r/funny'] = array( 'href', '//a[@class="thumbnail "]', 'http://i.imgur.com/');
$url_array['http://www.9gag.com/'] = array( 'src', '//img[@class="badge-item-img"]' );

// Init the return '$ret' array.
$ret = array();

// Roll through the '$url_array' array.
foreach ($url_array as $url_value => $ary_rules) {
  $html = file_get_contents($url_value);
  $dom = new DOMDocument();
  libxml_use_internal_errors(true);
  $dom->loadHTML($html);
  libxml_clear_errors();

  $xpath = new DOMXPath($dom);
  $hyperlinks = $xpath->evaluate($ary_rules[1]);

  foreach($hyperlinks as $hyperlink) {
    if( !$ary_rules[2] || strpos($hyperlink->getAttribute($ary_rules[0]), $ary_rules[2] ) !== FALSE){
      $ret[$url_value][] = $hyperlink->getAttribute($ary_rules[0]);
    }
  }
}
// Roll through the '$ret' array.
foreach($ret as $ret_value_list) {
    foreach($ret_value_list as $ret_value){ 
        echo "<img style='padding-left:30%' width=\"500\" src=\"" . $ret_value . "\" alt=\"\" />"
             . "<br>"
             . "<br>"
             . "<br>"
             ;
    }
}

Here is the define of the $url_array: `$url_array['the_page_path_you_want_to_pull'] = array( 'the_attribute_contains_image_url', 'the_html_dom_xpath', 'the_extra_rule_filter_out_the_address_you_do_not_want/please_remove_here-if_you_do_not_wish_to_filter');` — Zac, Dec 18 '13 at 03:11

Parsing multiple links through a PHP DOM to find classes

2 Answers2