Download bulk images in python

Question

After watching a video about how to download images using python, I typed the code in the video and here's the code

import pandas as pd
import urllib.request

def url_to_jpg(i, url, file_path):
    filename = 'image-{}.jpg'.format(i)
    fullpath = '{}{}'.format(file_path, filename)
    print(fullpath)
    urllib.request.urlretrieve(url, fullpath)
    print('{} saved.'.format(filename))
    return None

FILENAME = 'Images URLs.csv'
FILE_PATH = 'Images/'
urls = pd.read_csv(FILENAME)

for i, url in enumerate(urls.values):
    url_to_jpg(i, url, FILE_PATH)

When testing the code, I encountered error at this line urllib.request.urlretrieve(url, fullpath) which is like that

Images/image-0.jpg
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-36-d92ed57d1d8e> in <module>
     15 
     16 for i, url in enumerate(urls.values):
---> 17     url_to_jpg(i, url, FILE_PATH)

<ipython-input-36-d92ed57d1d8e> in url_to_jpg(i, url, file_path)
      6     fullpath = '{}{}'.format(file_path, filename)
      7     print(fullpath)
----> 8     urllib.request.urlretrieve(url, fullpath)
      9     print('{} saved.'.format(filename))
     10     return None

C:\ProgramData\Anaconda3\lib\urllib\request.py in urlretrieve(url, filename, reporthook, data)
    243     data file as well as the resulting HTTPMessage object.
    244     """
--> 245     url_type, path = _splittype(url)
    246 
    247     with contextlib.closing(urlopen(url, data)) as fp:

C:\ProgramData\Anaconda3\lib\urllib\parse.py in _splittype(url)
   1006         _typeprog = re.compile('([^/:]+):(.*)', re.DOTALL)
   1007 
-> 1008     match = _typeprog.match(url)
   1009     if match:
   1010         scheme, data = match.groups()

TypeError: cannot use a string pattern on a bytes-like object

Any ideas about that error?

** I have found the solution to a point which is modifying this line url_to_jpg(i, url[0], FILE_PATH)

But it seems that some of the links are not allowed as I got another error HTTPError: HTTP Error 403: Forbidden How can I overcome this?

** I tried to add headers (agent) as suggested but don't know how to finish it properly. How to use urlretrieve in that case?

import urllib.request

hdr = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11',
       'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
       'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3',
       'Accept-Encoding': 'none',
       'Accept-Language': 'en-US,en;q=0.8',
       'Connection': 'keep-alive'}

response = urllib.request.Request("http://www.gunnerkrigg.com//comics/00000001.jpg", headers=hdr)
print(urllib.request.urlopen(response))
urllib.request.urlretrieve(urllib.request.urlopen(response).read(),'oo.jpg')
#urllib.request.urlretrieve("http://www.gunnerkrigg.com//comics/00000001.jpg", "00000001.jpg")

It seems like you need to decode first. Does this answer your question? https://stackoverflow.com/questions/31019854/typeerror-cant-use-a-string-pattern-on-a-bytes-like-object-in-re-findall — Berkay, Dec 11 '20 at 08:06
Thanks a lot. The code is working now but why some links don't work? — YasserKhalil, Dec 11 '20 at 08:10
Well, as stated, it seems like it's forbidden. You cannot access without authorization. — Berkay, Dec 11 '20 at 08:14
Do you mean I have to login the website to get the image? Isn't the image has a url that should be loaded without access .. can you try link image link https://excel-egy.com/forum/download/avatar/Doctor.jpg? — YasserKhalil, Dec 11 '20 at 08:16
When I opened the link in IE browser it works and I can see the image without any authorization. ..!! — YasserKhalil, Dec 11 '20 at 08:17
Then, try adding an agent for your code. Check this: https://stackoverflow.com/questions/13303449/urllib2-httperror-http-error-403-forbidden — Berkay, Dec 11 '20 at 08:19
Thanks a lot. I am not expert at such stuff. Can you guide me how to modify the code here? — YasserKhalil, Dec 11 '20 at 08:21

Berkay · Accepted Answer · 2020-12-11T08:40:38.070

This code will help you overcome for HTTPError: HTTP Error 403: Forbidden

It's header added version of your code.

import pandas as pd
import urllib.request

# build an opener
opener = urllib.request.build_opener()

# add a header for opener
opener.addheaders = [('User-agent', 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.0.7) Gecko/2009021910 Firefox/3.0.7')]

# install opener once
urllib.request.install_opener(opener)

def url_to_jpg(i, url, file_path):
    filename = 'image-{}.jpg'.format(i)
    fullpath = '{}{}'.format(file_path, filename)
    print(fullpath)
    urllib.request.urlretrieve(url, fullpath)
    print('{} saved.'.format(filename))
    return None

FILENAME = 'Images URLs.csv'
FILE_PATH = 'Images/'
urls = pd.read_csv(FILENAME)

for i, url in enumerate(urls.values):
    url_to_jpg(i, url[0], FILE_PATH)

Amazing. Thank you very much. I just modified this line to work on my side `url_to_jpg(i, url[0], FILE_PATH)` — YasserKhalil, Dec 11 '20 at 08:38

Download bulk images in python

1 Answers1