Parse external HTML and return images

Question

I'm building a site that depends on bookmarklets. These bookmarklets pull the URL and a couple of other elements. However, I need to select 1 image from the page the user bookmarks. Currently I'm trying to use the PHP Simple HTML DOM Parser http://simplehtmldom.sourceforge.net/

It pulls the HTML as expected, and returns the tags as expected. However, I want to take this a step further and only return images with a min width of 40px. I know about the function getimagesize() but from what I understand, this is resource heavy. Is there a better method available to pre-process the image and achieve the results I'm looking for?

Thanks!

Obviously, `getimagesize` has to download the images if they're remote. Other than that, I don't know of any performance issues. Where did you read that? — Matthew Flaschen, Oct 23 '11 at 01:39
You could first check if the img tag has a width set, and go with that before resorting to getimagesize. Also, header information contains the size.. you could disregard any image larger than a certain size. Even though you don't know the dimensions, you can assume a 100kb image isn't 40x800 — Thilo Savage, Oct 23 '11 at 01:58
@matthew, downloading all the images is the performance issue. I don't want to waste the bandwidth if I don't have to. — Paul Dessert, Oct 23 '11 at 02:44
@thilo, do you have an example of retrieving the size from the header info? You're talking about file size, correct? Thanks. — Paul Dessert, Oct 23 '11 at 02:46
@Thilo, that doesn't tell you the dimensions of the image, or even the proportions. You can use the img tag to scale it to literally whatever you want. It would only be useful if a particular site had the habit of putting the actual dimensions in the tag. — Matthew Flaschen, Oct 23 '11 at 02:54
@Paul you can get the size from the header Content-Length. also check out this other answer, its in c# but it might be helpful. http://stackoverflow.com/questions/111345/getting-image-dimensions-without-reading-the-entire-file — Marshall Brekka, Oct 23 '11 at 03:50
Have you considered a service like http://embed.ly/ ? Haven't tried it myself, but seemed useful. — name, Oct 23 '11 at 06:56

score 0 · Accepted Answer · answered Oct 23 '11 at 04:21

First check if the image HTML tag has a width attribute. If it's above 40, skip over it. As Matthew mentioned, it will get false positives where people sized down a large image to 40px wide, but that's no big deal; the point of this step is to quickly weed out the first dozen or so images that are obviously too big.

Once the script catches an image that SAYS it's under 40px wide, check the header information to deduce a general width based on the size of the file. This is faster than getimagesize because you don't have to download the image to get the info.

function get_image_kb($path) {
    $headers = get_headers($path);
    $len = explode(" ",$headers[6]);
    return $len[1];
}


$imageKb = get_image_kb('test1.jpg');
// I'm going to gander 40x80 is about 2000kb
$cutoffSize = 2000;
if ($imageKb < $cutoffSize) {
    // this is the one!
}
else {
    // it was a phoney, keep scraping
}

Setting it at 2000kb will also let through images that are 100x30, which isn't good.

However, at this point, you've weeded out most of the huge 800kb files that would really slow you down, and because we know it's under 2kb, it's not too taxing to test this one with getimagesize() to get an accurate width.

You can tweak the process depending on how picky you are for the 40px mark, as usual higher accuracy takes more time, and vice versa.

thanks! This helped speed things up a bit. Now I just need to figure out how to speed up PHP Simple HTML DOM Parser :) — Paul Dessert, Oct 23 '11 at 07:30
Depending on what you need to scrape, maybe a regex would be faster than parsing the DOM — Thilo Savage, Oct 23 '11 at 13:11

Parse external HTML and return images

1 Answers1