How can I download images from URLs and skip those images that doesn't exists in Python?

Question

My all links are in working condition and I tested in browser still I am getting below errors while downloading images.

an error occurred while fetching: "http://epaperlokmat.in/eNewspaper/News/LOK/MULK/2020/04/29/20200429_1.jpeg"
an error occurred while fetching: "http://epaperlokmat.in/eNewspaper/News/LOK/MULK/2020/04/29/20200429_2.jpeg"
an error occurred while fetching: "http://epaperlokmat.in/eNewspaper/News/LOK/MULK/2020/04/29/20200429_3.jpeg"
an error occurred while fetching: "http://epaperlokmat.in/eNewspaper/News/LOK/MULK/2020/04/29/20200429_4.jpeg"

import urllib.request
from urllib.error import URLError # the docs say this is the base error you need to catch
import time
import datetime,time
from PIL import Image
start_time = time.time()
today=time.strftime("%Y%m%d")
m=today=time.strftime("%m")
d=today=time.strftime("%d")
Y=today=time.strftime("%Y")
A=today=time.strftime("%b")

for i in range(1,5):
    issue_id1=str(i)
    url = "http://epaperlokmat.in/eNewspaper/News/LOK/MULK/"+str(Y) +"/"+str(m)+"/"+str(d)+"/"+str(Y+m+d)+"_"+str(i)+".jpeg"
    try:        
        s = urllib.request.urlopen(url)
        contents = s.read()
    except URLError:
        print('an error occurred while fetching: "{}"'.format(url))
        continue
    file = open("D:/IMAGES/"+issue_id1+".jpeg", "wb")
    file.write(contents)

If you except the URLError and print its output you get the following: `HTTP Error 403: Forbidden`. It's an authorization issue. It may be blocking it because you are going too fast or do not look like an actual human user. — Treatybreaker, Apr 29 '20 at 08:07
does this help [Downloading a picture via urllib and python](https://stackoverflow.com/questions/3042757/downloading-a-picture-via-urllib-and-python) — Alessandro Candeloro, Apr 29 '20 at 08:11
Does this answer your question? [urllib2.HTTPError: HTTP Error 403: Forbidden](https://stackoverflow.com/questions/13303449/urllib2-httperror-http-error-403-forbidden) — Lydia van Dyke, Apr 29 '20 at 08:18

S.D. · Answer 1 · 2020-04-29T11:38:04.080

It seems that this host where you fetch the images doesn't like the default headers shipped with urllib.

This adjusted version seems to fetch it your images correctly:

import urllib.request
from urllib.error import URLError # the docs say this is the base error you need to catch
import time
import datetime,time
from PIL import Image
start_time = time.time()
today=time.strftime("%Y%m%d")
m=today=time.strftime("%m")
d=today=time.strftime("%d")
Y=today=time.strftime("%Y")
A=today=time.strftime("%b")

fetched_images = []

for i in range(1,5):
    issue_id1=str(i)
    url = "http://epaperlokmat.in/eNewspaper/News/LOK/MULK/"+str(Y) +"/"+str(m)+"/"+str(d)+"/"+str(Y+m+d)+"_"+str(i)+".jpeg"
    try:
        # First build the request, and adjust the headers to something else.
        req = urllib.request.Request(url, 
            headers={'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.47 Safari/537.36'}
        )

        # Secondly fetch your image
        s = urllib.request.urlopen(req)
        contents = s.read()

        # Append to your image-list
        fetched_images.append(url)
    except URLError:
        print(url)
        print('an error occurred while fetching: "{}"'.format(url))
        continue
    file = open("D:/IMAGES/"+issue_id1+".jpeg", "wb")
    file.write(contents)

To clarify, firstly build your request with adjusted headers. Only then open the url by fetching the req.

Another way to go about this is to use requests. In your case it actually works out of the box. Before this will run you will need to get the requests package. pip install requests

import requests
import datetime,time

start_time = time.time()
today=time.strftime("%Y%m%d")
month=today=time.strftime("%m")
day=today=time.strftime("%d")
year=today=time.strftime("%Y")

url = "http://epaperlokmat.in/eNewspaper/News/LOK/MULK/{year}/{month}/{day}/{year}{month}{day}_{issue_id}.jpeg"
path = "D:/IMAGES/{issue_id}.jpeg"

fetched_images = []

for issue_id in range(1, 5):
    try:
        # Let's create the url for the given issue.
        issue_url = url.format(
            year=year,
            month=month,
            day=day,
            issue_id=issue_id)

        # GET the url content
        req = requests.get(issue_url, headers={'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.47 Safari/537.36'}))

        # Add the image to your list
        fetched_images.append(issue_url)

        # Save to file if succesful and close the file when done.
        with open(path.format(issue_id=issue_id), 'wb') as f:
            f.write(req.content)
    except Exception as e:
        # If something went wrong, just print the url and the error.
        print('Failed to fetch {url} with error {e}'.format(
            url=issue_url, e=e))

Can you also please help me out with a code to make a imglist with downloaded images. Because images numbers can be dynamic so I am not able to make image list. — Akshay Poklekar, Apr 29 '20 at 09:11
See the second example, where an images is appended to `fetched_images` — S.D., Apr 29 '20 at 09:31
Hi. can you provide me `fetched_images` answer for first example? @S.D. — Akshay Poklekar, Apr 29 '20 at 09:40
Here you go. But in your first example, in case a fetch fails you will be writing a previous image to your new file. You should really put the save-code into the `try` clause. Not it will always run and may cause the wrong image to be saved. The second example fixes this. — S.D., Apr 29 '20 at 09:57
Actually second example doesnt work for me to download images. And first examples works perfectly. So thats why I need to make dynamic imagelist with first example. — Akshay Poklekar, Apr 29 '20 at 10:09
I just noticed the path would same the images in the same dir where you run the code from. I just updated the answer. Does that solve your problem? Otherwise, kindly post the error — S.D., Apr 29 '20 at 10:11
In your second example, default header missing and thats creating the problem. And in first example default header your mentioned with user agent, works like a charm. I will appreciate if you solve my problem by putting header user agent part in second example. — Akshay Poklekar, Apr 29 '20 at 11:05
Second example download blank images with no format/extension with 0 byte size if URL not found 404 error is there on URL — Akshay Poklekar, Apr 29 '20 at 11:40
if i changed 'for issue_id in range(1, 5):' to 'for issue_id in range(1, 12):' then are no images on url after 8th url still it is downloading blank file for 9th,10th and 11th numbers URL. — Akshay Poklekar, Apr 29 '20 at 11:51

How can I download images from URLs and skip those images that doesn't exists in Python?

1 Answers1