How to scrape only the largest images from the DOM?

Question

I am using SimpleHTMLDOM to scrape pages (in servers other than mine).

The basic implementation is

try {
    $html = file_get_html(urldecode(trim($url)));
} catch (Exception $e) {
    echo $url;
}

foreach ($html->find('img') as $element) {
  $src = "";
  $src = $element->src;
    if (preg_match("/\.(?:jpe?g|png)$/i", $src)) {
        $images[] = $src;
    }
}

This works fine but it returns all images from the page, including small avatars, icons, and button images. Of course I'd like to avoid these.

I then tried to insert within the loop as follows

...

if (preg_match("/\.(?:jpe?g|png)$/i", $src)) {
    $size = getimagesize($src);
    if ($size[0] > 200) {
        $images[] = $src;
    }
}
...

That works well on a page like http://cnn.com. But in others it returns numerous errors. For example

http://www.huffingtonpost.com/2012/05/27/alan-simpson-republicans_n_1549604.html

gives a bunch of errors like

<p>Severity: Warning</p>
<p>Message:  getimagesize(/images/snn-logo-comments.png): failed to open stream: No such file or directory
<p>Severity: Warning</p>
<p>Message:  getimagesize(/images/close-gray.png): failed to open stream: No such file or directory

which seem to happening because of relative URLs in some images. The problem here is that this crashes the script and then no images a loaded, with my Ajax box loading forever.

Do you have any ideas how to troubleshoot this?

Scraping causes a lot of alarm bells to go off around here. It can help to put in some context about your legitimate reasons for doing this, to let people know they're not aiding and abetting content thievery. — grossvogel, May 28 '12 at 22:57
Are you serious? Have you ever heard of Pinterest or Facebook? — pepe, May 28 '12 at 23:08
I'm also a little "Oh, this" about this question, but I don't think we should judge without any facts. Scraping is __kind__ of sleazy looking, but then again who knows what it's for. Also, is it that difficult for the admins of these sites to see him scraping and block the IP? Of course, ways around that...but if you're scraping and getting cut off and then finding a workaround, well, then you're an a$$. — phatskat, May 28 '12 at 23:24
Oh my @dagon, I'm not having this discussion. Points made by grossvogel are valid. — pepe, May 28 '12 at 23:26

score 1 · Answer 1 · edited May 23 '17 at 11:54

1

The problem is that the image URLs are relative to the site root, so your server can't make sense of them to fetch them and find out their size. You could refer to this question to figure out how to get absolute URLs from relative ones.

edited May 23 '17 at 11:54

Community

1
1

answered May 28 '12 at 22:41

grossvogel

6,694
1
25
36

score 0 · Answer 2 · answered May 28 '12 at 22:44

The approach you tried with image size checking is correct.

However, in order for it to work on all sites, you would need to add some kind of relative URL parsing.

I don't know if there are any libraries or such for it but here's a quick overview on how to do it:

Find the domain part of the URL you're scraping
Assume any URL starting with / is an absolute URL. You can fetch these simply by concatenating domain and path
Assume any URL not starting with / is relative. You may need to parse any .. markers in the URL to locate the expected path
Check for the <base> tag in the document: If the document has a <base> tag, it will anchor all relative paths into the path defined in the tag.

You may be able to find a library to convert relative paths and absolute paths into something you can use, but in most cases they will not account for the <base> tag mentioned in the last point.

score 0 · Answer 3 · answered May 28 '12 at 23:21

Try something like this assuming a url of http://somedomain.com...

$domain = explode('/', $url);
$domain = $domain[2];

// ... snip ...

if (preg_match("/\.(?:jpe?g|png)$/i", $src)) {
    $size = getimagesize($src);
    if ($size[0] > 200) {
        if(strpos($src, '/', 0) === 0)
            $src = $domain . $src;

        $images[] = $src;
    }
}

This will help some, but it won't be fool-proof - I can't think of many domains using ../../etc relative paths to images, but I'm sure someone is - of course, you could test for a match of anything other than the domain in the image's src attribute, and try throwing the domain on there but no promises that will work every time either. I would think there's a better way... perhaps have a default method and load a config with predefined domain "fixes" for troublesome domains.

How to scrape only the largest images from the DOM?

3 Answers3