4

Given an HTML page with a news article I'm trying to detect the relevant image(s) from the article. For this, I'm looking at the sizes of the images (if they're too small likely they are navigational elements), but I don't want to download every image.

Is there an way to get the width and height of the image without downloading the full image?

mirceapasoi
  • 315
  • 3
  • 13
  • You should consider looking at the img tags like @gor suggests. You can bet that they use the same template to post each news story, so you can probably pull it by a div or img id / class. – Brian D Feb 13 '11 at 11:17
  • Scraping the web to create one giant newsfeed, huh. That's a cool idea :) (Found your http://summify.com/) – Brian D Feb 13 '11 at 11:19
  • No will be the immediate answer and @Brian D suggestion is an option but don't forget that css might come into play... An alternative to not download the whole image is to do a HEAD for the image url. That will probably return a content-length which gives you the "size"(although not in height/width) of the image. Far fetched idea: You might even take it one step further, if you know the content-type and you know that you only need the first 128 bytes to determine the actual width and height, you stop pulling bytes from the server after the first 128 bytes... – rene Feb 13 '11 at 11:29
  • @agerhalls thos 128 bytes where an example... If you know the format of the file AND the height/widht is in the first 128 bytes (or 256, or 1024, or first 4 bytes) you only need to fetch those bytes. if the file format stores the size info in the last four bytes you have no other option to the process the whole file. Is it in line with your answer. – rene Feb 13 '11 at 13:22
  • Ah ok, I get it :) ... we're in line, that's what my sample does. It returns the size info as soon as it gets it from the response stream. For JPEGs you won't know the exact position of the header. They can have nested thumbnails with their own size info that needs to be skipped first, so the size info can be pretty late in the file. – asgerhallas Feb 13 '11 at 14:17
  • @rene that's exactly what I'm looking to do. I wonder if there's a Python library that I can use instead of doing it manually for every format. – mirceapasoi Feb 13 '11 at 21:59
  • @mirceapasoi check my edit to my answer regarding Python... – asgerhallas Feb 13 '11 at 22:52
  • @asgerhallas Thanks a lot! @BrianD Glad you like Summify! – mirceapasoi Feb 13 '11 at 23:16
  • Actually, the 'process the whole file' isn't strictly true, as it happens. If you actually know where the information is, you can get it with a careful HTTP request with Range headers, where you can get any arbitrary byte range within the file (even just the last 4 bytes if you prefer). JPEGs are tricky, though, for all the reasons given above; I'm looking at grabbing 8KB chunks of the file and seeing how often the header will be in there. – Arantor Oct 19 '11 at 16:13

2 Answers2

2

Don't know if it'll help you speed up your application, but it can be done. Checkout these two articles:

http://www.anttikupila.com/flash/getting-jpg-dimensions-with-as3-without-loading-the-entire-file/ for JPEG

http://www.herrodius.com/blog/265 for PNG

They are both for ActionScript, but the principle applies for other languages as well of course.

I made a sample using C#. It's not the prettiest code and it only works for JPEGs, but can be easily extended to PNG too:

var request = (HttpWebRequest) WebRequest.Create("http://unawe.org/joomla/images/materials/posters/galaxy/galaxy_poster2_very_large.jpg");
using (WebResponse response = request.GetResponse())
using (Stream responseStream = response.GetResponseStream())
{
    int r;
    bool found = false;
    while (!found && (r = responseStream.ReadByte()) != -1)
    {
        if (r != 255) continue;

        int marker = responseStream.ReadByte();

        // App specific
        if (marker >= 224 && marker <= 239)
        {
            int payloadLengthHi = responseStream.ReadByte();
            int payloadLengthLo = responseStream.ReadByte();
            int payloadLength = (payloadLengthHi << 8) + payloadLengthLo;
            for (int i = 0; i < payloadLength - 2; i++)
                responseStream.ReadByte();
        }
        // SOF0
        else if (marker == 192)
        {
            // Length of payload - don't care
            responseStream.ReadByte();
            responseStream.ReadByte();

            // Bit depth - don't care
            responseStream.ReadByte();

            int widthHi = responseStream.ReadByte();
            int widthLo = responseStream.ReadByte();
            int width = (widthHi << 8) + widthLo;

            int heightHi = responseStream.ReadByte();
            int heightLo = responseStream.ReadByte();
            int height = (heightHi << 8) + heightLo;

            Console.WriteLine(width + "x" + height);
            found = true;
        }
    }
}

EDIT: I'm no Python expert, but this article seems to desribe a Python lib doing just that (last sample): http://effbot.org/zone/pil-image-size.htm

asgerhallas
  • 16,890
  • 6
  • 50
  • 68
  • It should be noted, that this sample does not abort the download, it just writes the result out as soon as it has it and continues. But it is trivial to abort it, for .NET it just requires doing the request async. Another note, is that for progressive jpegs you need to check for SOF2 too. – asgerhallas Feb 13 '11 at 22:34
  • Thanks for the answer and for the Python related link, that's exactly what I've been looking for! – mirceapasoi Feb 13 '11 at 23:15
1

No, it is not possible. But you can get information from img tags, but not from backgrounds.

gor
  • 11,498
  • 5
  • 36
  • 42
  • 1
    Agreed. Most of the time will be spent downloading the large images anyway. You're not going to save much time by skimping on the small images. (Try downloading everything asynchronously.) – Natan Yellin Feb 13 '11 at 11:20
  • Yes, asynchronous downloading can speed up things a lot. But do not spawn a lot threads, use asynchronous functions. – gor Feb 13 '11 at 11:25
  • Why isn't it possible? I don't think you can rely on the information in the img tags, you can put any values for width and height there. – mirceapasoi Feb 13 '11 at 21:46
  • It is possible. I just did it :) – asgerhallas Feb 13 '11 at 22:31