I am crawling links from a website (this one), but the structure of the website creates unwanted additional output. Basically, the <a>
tags have the name of an article and additional information (images and sources of those images) inside them. I would like to get ride of the additional information. I found the :not Selector to do that, but I guess I am implementing it wrong, because every combination I have tried gives me no output at all.
Here is the code I need to alter:
$posts = $html->find('ul[class=river] a[data-omni-click=inherit] :not[figure]');
(I have also tried figure:not
and a couple of other combinations)
Does anyone know where I went wrong, and what I have to do to exclude the <figure>
tag?
Here is my full code, not sure if that helps:
<div class='rcorners1'>
<?php
include_once('simple_html_dom.php');
$target_url = "http://www.theatlantic.com/most-popular/";
$html = new simple_html_dom();
$html->load_file($target_url);
$posts = $html->find('ul[class=river] a[data-omni-click=inherit] :not[figure]');
$limit = 10;
$limit = count($posts) < $limit ? count($posts) : $limit;
for($i=0; $i < $limit; $i++){
$post = $posts[$i];
$post->href = 'http://www.theatlantic.com'.$post->href;
echo strip_tags($post, '<p><a>'); //echo ($post);
}
?>
</div>
</div>