Get the specific image file size hosted from a server

Question

Currently writing a python automation script on a website. There are 50 to 100 images hosted in a cloud, and all images are structured like this:

<img style="width:80px;height:60px;"
     src="http://someimagehostingsite.net/somefolder/some_random_url_with_timestamp">

The url doesn't have any suffix like .jpg or .png to get the info directly. But I was able to make it by downloading the images one by one and getting the image file size. But I need to automate this process, by just accessing every url and getting the file size. Is it possible?

Is your actual goal just to download all of the images? Normally to display an image you don't need to know its dimensions. What I'm guessing you are trying to do is parse the HTML of the web page and download the images, and the width/height aren't actually important? — Jason Livesay, Feb 09 '19 at 00:38
I just want to know the file size of an image, but cannot do it in automation becasue of the CORS issue. By downloading the image one by one, it is possible to get the file size. But as stated in the question, I do not want to download them, I want it by automation. — GCarl, Feb 09 '19 at 00:43
look here for xhr https://stackoverflow.com/questions/1310378/determining-image-file-size-dimensions-via-javascript — nbk, Feb 09 '19 at 00:49
@GCarl See [List file sizes of all images on a page (Chrome Extension)](https://stackoverflow.com/q/41085017/) — guest271314, Feb 09 '19 at 01:02

J. Taylor · Answer 1 · 2019-02-09T01:14:39.693

If you are just trying to get the content length of a file by URL, you can do so by downloading only the HTTP headers and checking the Content-Length field:

import requests
url='https://commons.wikimedia.org/wiki/File:Leptocorisa_chinensis_(20566589316).jpg'

http_response = requests.get(url)

print(f"Size of image {url} = {http_response.headers['Content-Length']} bytes")

However, if the image is compressed by the server before sending, the Content-Length field will contain the compressed file size (the amount of data that will actually be downloaded) rather than the uncompressed image size.

To do this for all of the images on a given page, you could use the BeautifulSoup HTML processing library to extract a list of URLs for all of the images on the page and check the file size as follows:

from time import sleep
import requests
from bs4 import BeautifulSoup as Soup

url='https://en.wikipedia.org/wiki/Agent_Orange'

html = Soup(requests.get(url).text)

image_links = [(url + a['href']) for a in html.find_all('a', {'class': 'image'})]

for img_url in image_links:
    response = requests.get(img_url)
    try:
        print(f"Size of image {img_url} = {response.headers['Content-Length']} bytes")
    except KeyError:
        print(f"Server didn't specify content length in headers for {img_url}")
    sleep(0.5)

You'll have to adjust this to your specific problem, and might have to pass other parameters to soup.find_all() to narrow it down to the specific images you're interested in, but something similar to this will achieve what you're trying to do.

I'll add a caveat though, that this method isn't always reliable and depends on the server returning a `Content-Length` in the headers. In my experience, most servers do, but specific sites you're crawling might not ... in which case, I don't know any method to determine file size that wouldn't involve downloading the entire file. — J. Taylor, Feb 09 '19 at 01:23

score 0 · Answer 2 · answered Feb 09 '19 at 00:52

You could try to see if you could send a HEAD request from the browser for each image. HTTP HEAD Request in Javascript/Ajax? It depends on if the HTTP server supports it properly. I also am not sure how you get the Content-Length header but that sounds like what you want.

Get the specific image file size hosted from a server

2 Answers2