-2

I am using SimpleHTMLDOM to scrape pages (in servers other than mine).

The basic implementation is

try {
    $html = file_get_html(urldecode(trim($url)));
} catch (Exception $e) {
    echo $url;
}

foreach ($html->find('img') as $element) {
  $src = "";
  $src = $element->src;
    if (preg_match("/\.(?:jpe?g|png)$/i", $src)) {
        $images[] = $src;
    }
}

This works fine but it returns all images from the page, including small avatars, icons, and button images. Of course I'd like to avoid these.

I then tried to insert within the loop as follows

...

if (preg_match("/\.(?:jpe?g|png)$/i", $src)) {
    $size = getimagesize($src);
    if ($size[0] > 200) {
        $images[] = $src;
    }
}
...

That works well on a page like http://cnn.com. But in others it returns numerous errors. For example

http://www.huffingtonpost.com/2012/05/27/alan-simpson-republicans_n_1549604.html

gives a bunch of errors like

<p>Severity: Warning</p>
<p>Message:  getimagesize(/images/snn-logo-comments.png): failed to open stream: No such file or directory
<p>Severity: Warning</p>
<p>Message:  getimagesize(/images/close-gray.png): failed to open stream: No such file or directory

which seem to happening because of relative URLs in some images. The problem here is that this crashes the script and then no images a loaded, with my Ajax box loading forever.

Do you have any ideas how to troubleshoot this?

pepe
  • 9,799
  • 25
  • 110
  • 188
  • Whoever downvoted please justify. – pepe May 28 '12 at 22:54
  • 1
    Scraping causes a lot of alarm bells to go off around here. It can help to put in some context about your legitimate reasons for doing this, to let people know they're not aiding and abetting content thievery. – grossvogel May 28 '12 at 22:57
  • and you have permission from the site owners to do this ? –  May 28 '12 at 23:06
  • Are you serious? Have you ever heard of Pinterest or Facebook? – pepe May 28 '12 at 23:08
  • ever heard of copyright? –  May 28 '12 at 23:09
  • I'm also a little "Oh, this" about this question, but I don't think we should judge without any facts. Scraping is __kind__ of sleazy looking, but then again who knows what it's for. Also, is it that difficult for the admins of these sites to see him scraping and block the IP? Of course, ways around that...but if you're scraping and getting cut off and then finding a workaround, well, then you're an a$$. – phatskat May 28 '12 at 23:24
  • Oh my @dagon, I'm not having this discussion. Points made by grossvogel are valid. – pepe May 28 '12 at 23:26

3 Answers3

1

The problem is that the image URLs are relative to the site root, so your server can't make sense of them to fetch them and find out their size. You could refer to this question to figure out how to get absolute URLs from relative ones.

Community
  • 1
  • 1
grossvogel
  • 6,694
  • 1
  • 25
  • 36
0

The approach you tried with image size checking is correct.

However, in order for it to work on all sites, you would need to add some kind of relative URL parsing.

I don't know if there are any libraries or such for it but here's a quick overview on how to do it:

  • Find the domain part of the URL you're scraping
  • Assume any URL starting with / is an absolute URL. You can fetch these simply by concatenating domain and path
  • Assume any URL not starting with / is relative. You may need to parse any .. markers in the URL to locate the expected path
  • Check for the <base> tag in the document: If the document has a <base> tag, it will anchor all relative paths into the path defined in the tag.

You may be able to find a library to convert relative paths and absolute paths into something you can use, but in most cases they will not account for the <base> tag mentioned in the last point.

Jani Hartikainen
  • 42,745
  • 10
  • 68
  • 86
0

Try something like this assuming a url of http://somedomain.com...

$domain = explode('/', $url);
$domain = $domain[2];

// ... snip ...

if (preg_match("/\.(?:jpe?g|png)$/i", $src)) {
    $size = getimagesize($src);
    if ($size[0] > 200) {
        if(strpos($src, '/', 0) === 0)
            $src = $domain . $src;

        $images[] = $src;
    }
}

This will help some, but it won't be fool-proof - I can't think of many domains using ../../etc relative paths to images, but I'm sure someone is - of course, you could test for a match of anything other than the domain in the image's src attribute, and try throwing the domain on there but no promises that will work every time either. I would think there's a better way... perhaps have a default method and load a config with predefined domain "fixes" for troublesome domains.

phatskat
  • 1,797
  • 1
  • 15
  • 32