1

I am crawling links from a website (this one), but the structure of the website creates unwanted additional output. Basically, the <a> tags have the name of an article and additional information (images and sources of those images) inside them. I would like to get ride of the additional information. I found the :not Selector to do that, but I guess I am implementing it wrong, because every combination I have tried gives me no output at all.

Here is the output.

Here is the code I need to alter:

$posts = $html->find('ul[class=river] a[data-omni-click=inherit] :not[figure]');

(I have also tried figure:not and a couple of other combinations)

Does anyone know where I went wrong, and what I have to do to exclude the <figure> tag?

Here is my full code, not sure if that helps:

<div class='rcorners1'>
 <?php
include_once('simple_html_dom.php');

$target_url = "http://www.theatlantic.com/most-popular/";

$html = new simple_html_dom();

$html->load_file($target_url);

$posts = $html->find('ul[class=river] a[data-omni-click=inherit] :not[figure]');
$limit = 10;
$limit = count($posts) < $limit ? count($posts) : $limit;
for($i=0; $i < $limit; $i++){
  $post = $posts[$i];
  $post->href = 'http://www.theatlantic.com'.$post->href;
  echo strip_tags($post, '<p><a>'); //echo ($post); 

}
?>
</div>
</div> 
Community
  • 1
  • 1
Jasper
  • 47
  • 1
  • 8
  • Please post the html output that this php creates please. – Aaron Mar 01 '16 at 08:46
  • 1
    this should help you understand the `:not` pseudo http://stackoverflow.com/a/35650693/1676224 – Aaron Mar 01 '16 at 08:51
  • 1
    Although the syntax is probably inspired by CSS selectors, the simple_html_dom library for PHP does NOT use CSS selectors to find HTML elements and :not selectors are not likely to work. http://simplehtmldom.sourceforge.net/manual.htm#section_find – reinder Mar 01 '16 at 11:22
  • @Aaron, [here](http://globalsocialnews.com/crawler/test8.php) is the html output. Thank you for the clarification on using the `:not` selector. @reinder, yep, I checked the library and they include the option of "[attribute!=value] Matches elements that don't have the specified attribute with a certain value.". I tried that also but couldn't figure out how to make it work. – Jasper Mar 01 '16 at 12:30

0 Answers0