I am using SimpleHTMLDOM to scrape pages (in servers other than mine).
The basic implementation is
try {
$html = file_get_html(urldecode(trim($url)));
} catch (Exception $e) {
echo $url;
}
foreach ($html->find('img') as $element) {
$src = "";
$src = $element->src;
if (preg_match("/\.(?:jpe?g|png)$/i", $src)) {
$images[] = $src;
}
}
This works fine but it returns all images from the page, including small avatars, icons, and button images. Of course I'd like to avoid these.
I then tried to insert within the loop as follows
...
if (preg_match("/\.(?:jpe?g|png)$/i", $src)) {
$size = getimagesize($src);
if ($size[0] > 200) {
$images[] = $src;
}
}
...
That works well on a page like http://cnn.com
.
But in others it returns numerous errors.
For example
http://www.huffingtonpost.com/2012/05/27/alan-simpson-republicans_n_1549604.html
gives a bunch of errors like
<p>Severity: Warning</p>
<p>Message: getimagesize(/images/snn-logo-comments.png): failed to open stream: No such file or directory
<p>Severity: Warning</p>
<p>Message: getimagesize(/images/close-gray.png): failed to open stream: No such file or directory
which seem to happening because of relative URLs in some images. The problem here is that this crashes the script and then no images a loaded, with my Ajax box loading forever.
Do you have any ideas how to troubleshoot this?