4

So I am building a Python script to download images from a list of urls. The script works to an extent. I don't want it to download images that have urls that don't exist. I take care of a few images with some usage of status code but still get bad images. I still get many images that I don't want. Like these:

bad image enter image description here

Here is my code:

import os
import requests
import shutil
import random
import urllib.request

def sendRequest(url):
    try:
        page = requests.get(url, stream = True, timeout = 1)

    except Exception:
        print('error exception')
        pass

    else:
        #HERE IS WHERE I DO THE STATUS CODE
        print(page.status_code)
        if (page.status_code == 200):
            return page

    return False

def downloadImage(imageUrl: str, filePath: str):
    img = sendRequest(imageUrl)

    if (img == False):
        return False

    with open(filePath, "wb") as f:
        img.raw.decode_content = True

        try:
            shutil.copyfileobj(img.raw, f)
        except Exception:
            return False

    return True

os.chdir('/Users/nikolasioannou/Desktop')
os.mkdir('folder')

fileURL = 'http://www.image-net.org/api/text/imagenet.synset.geturls?wnid=n04122825'
data = urllib.request.urlopen(fileURL)

output_directory = '/Users/nikolasioannou/Desktop/folder'

line_count = 0

for line in data:
    img_name = str(random.randrange(0, 10000)) + '.jpg'
    image_path = os.path.join(output_directory, img_name)
    downloadImage(line.decode('utf-8'), image_path)
    line_count = line_count + 1
#print(line_count)

Thanks for your time. Any ideas are appreciated.

Sincerely, Nikolas

  • you could check for the jpeg or png header and magic sequence – juliusmh Aug 09 '18 at 00:27
  • Thanks for the quick response! Sorry, I am fairly new to Python, how could I do this? @juliusmh –  Aug 09 '18 at 00:29
  • Possible duplicate of [How to check if a file is a valid image file?](https://stackoverflow.com/questions/889333/how-to-check-if-a-file-is-a-valid-image-file) – juliusmh Aug 09 '18 at 00:29
  • Is the problem that you're getting non-images like an HTML page, or that you're getting useless placeholder images? – Kevin J. Chase Aug 09 '18 at 00:55

1 Answers1

4

you could check for the jpeg or png header and their respective magic sequence which is always a pretty good indicator for a valid image. Look at this so question.

You can take al look at file signatures (aka magic numbers) here. You then just have to check the firs n bytes of response.raw

I modified your sendRequest/download function a little bit, you should be able to hardcode more valid image file extensions than just the JPG magic number. I finally tested the code and it is working (on my machine). Only valid JPG images were saved. Note that i removed the stream=True flag because the images are so small you don't need to have a stream. And the saving gets a little bit less cryptic. Take a look:

def sendRequest(url):
    try:
        page = requests.get(url)

    except Exception as e:
        print("error:", e)
        return False

    # check status code
    if (page.status_code != 200):
        return False

    return page

def downloadImage(imageUrl: str, filePath: str):
    img = sendRequest(imageUrl)

    if (img == False):
        return False

    if not img.content[:4] == b'\xff\xd8\xff\xe0': return False

    with open(filePath, "wb") as f:
        f.write(img.content)

    return True

You could also try to open the image using Pillow and BytesIO

>>> from PIL import Image
>>> from io import BytesIO

>>> i = Image.open(BytesIO(img.content))

and see if it throws an error. But the first solution seems more lightweight - you should not get any false positives there. You could also check for the string "<html>" in im.content and abort if it was found - this is very easy and probably very effective too.

juliusmh
  • 457
  • 3
  • 12
  • How do I check for a header and its respective magic sequence? I looked at the linked question and didnt understand much. I appreciate your help. –  Aug 09 '18 at 00:35
  • I guess where I'm confused is what do the file signatures do? What are they going to tell me about an image file and how do I know which file signature I should look for for an image with a false url? –  Aug 09 '18 at 00:42
  • basically you have no idea what the server responds or id the url is not existing anymore and so on. So files like JPEG or PNG images have a constant defined series of bytes to start with so an application can detect the file type without relying on the extension. You problem is not about the url. You have a bunch of bytes that you just downloaded, and you want to check if this bunch of bytes is an image. I update my question for the signature check. – juliusmh Aug 09 '18 at 00:45
  • Oh i seem, @juliusmh. thanks for the explanation. Ill check out the answer –  Aug 09 '18 at 00:49
  • Glad it helped, just comment if you need additional help. This sounds like a common problem maybe there is a library or so or you could bundle those signatures with a check function into a small library, would be good practice too! There is one more way to check if the bytes are a valid image: you could try to load the image using Pillow and BytesIO – juliusmh Aug 09 '18 at 00:51
  • im sorry to sound so helpless but i tried this code above by just replacing my old download function and every downloaded image gave the err "The file “name.jpg” could not be opened because it is empty." upon opening it. Also I am not sure what you were referring to above when you said "hardcode more valid image file extensions" –  Aug 09 '18 at 00:55
  • Ah god i made a mistake, the updated solution should work now? Hopefully :) – juliusmh Aug 09 '18 at 01:06
  • Thanks, ill give it a try –  Aug 09 '18 at 05:03