Download all the image files found using regex on a website to a specified directory in my computer in python

Question

I have a code here that finds all the image files using regex by looking up its file extension. Now what I want to do is save it to a specified path on my computer and also preserving its original filenames. My current code finds the images because I tested by printing the 'source' but doesn't save it to the specified directory maybe anyone can help me tweak the code.

Thanks in advance.

Here's my code:

import urllib,re,os

_in = raw_input('< Press enter to download images from first page >')
if not os.path.exists('FailImages'): # Directory that I want to save the image to
        os.mkdir('FailImages') # If no directory create it

source = urllib.urlopen('http://www.samplewebpage.com/index.html').read()

imgs = re.findall('\w+.jpg',source) # regex finds files with .jpg extension

# This bit that needs tweaking

for img in imgs:
        filename = 'src="'+ img.split('/')[0]
        if not os.path.exists(filename):
                urllib.urlretrieve(img,filename)

I suspect you're going to have a much more challenging task on your hands than simply dumping all the image files into a folder. That will work well only if the images aren't named identically. Your best bet would be to capture the relative path to the image (for local images) and recreate the folder structure locally; for external images, you may want to create a similar structure, but contained inside a folder like `www.externalimage.com`. — brandonscript, Dec 05 '13 at 07:37
It doesnt matter if the images on the page have the same file names — user3034404, Dec 05 '13 at 12:53
yes, i just need a simple code that will download/save images from a website to my folder. The code doesn't have to be robust. — user3034404, Dec 05 '13 at 22:39

brandonscript · Accepted Answer · 2013-12-06T01:30:53.737

1

This should get you going. It's not handling whether or not it's an external link, but it will grab local images,

Optional

install dependency requests from http://requests.readthedocs.org/en/latest/
From the command line, execute:
$ sudo easy_install requests

If using requests, uncomment the 3 f.____ lines and #comment out the last urllib.urlretrieve line:

import urllib2,re,os
#import requests

folder = "FailImages"

if not os.path.exists(folder): # Directory that I want to save the image to
    os.mkdir(folder) # If no directory create it

url = "http://www.google.ca"
source = urllib2.urlopen(url).read()

imgs = re.findall(r'(https?:/)?(/?[\w_\-&%?./]*?)\.(jpg|png|gif)',source, re.M) # regex finds files with .jpg extension


for img in imgs:
    remote = url + img[1] + "." + img[2];
    filename = folder + "/" + img[1].split('/')[-1] + "." + img[2]
    print "Copying from " + remote + " to " + filename
    if not os.path.exists(filename):
        f = open(filename, 'wb')
        f.write(urllib2.urlopen(remote).read())
        #f.write(requests.get(remote).content)
        f.close()

Note: Requests works a lot better and ensures the correct headers are sent, urllib may not work much of the time.

edited Dec 06 '13 at 01:30

answered Dec 05 '13 at 23:11

brandonscript

68,675
32
163
220

Thanks for the code. But I must use the standard module on Python. So no installing of modules. – user3034404 Dec 05 '13 at 23:37
I got it almost working. what i did was uncomment the f.close and f=open but left the f.write commented because its giving me error'requests not defined'. It fetched and saved the images with their original filenames which is what I wanted BUT contains nothing like no bytes just a file in the folder. Any suggestions? Thanks in advance – user3034404 Dec 06 '13 at 01:09
That's the problem you're going to get with urlretrieve - it doesn't pass the correct headers. If my power hadn't just gone out, I'd edit it with url retrieve inside the f.write. If it comes back on, I'll update. Try yourself if you can, wrapping url retrieve in f.write() – brandonscript Dec 06 '13 at 01:15
I got it sort of working now. It downloads only partial size of the file like for one of the images, it only downloads 246bytes instead of 43KB. What do you think should I do? – user3034404 Dec 06 '13 at 01:30
Got it-- fixed it up to use urllib2 (why I didn't use in the first place?) – brandonscript Dec 06 '13 at 01:31
It downloads fine on a Google website but when I try it on the website that Im gonna use it on, i get these errors File "C:\Python27\lib\urllib2.py", line 531, in http_error_default raise HTTPError(req.get_full_url(), code, msg, hdrs, fp) HTTPError: HTTP Error 404: Not Found – user3034404 Dec 06 '13 at 01:42
Can I ask, I also want to display if there are some files that didn't download. Can you possibly help me to do this? – user3034404 Dec 06 '13 at 02:05
Great stuff. Good luck. – brandonscript Dec 06 '13 at 02:14
Best to read up on urllib2 and error handling. A good resource: http://stackoverflow.com/questions/666022/what-errors-exceptions-do-i-need-to-handle-with-urllib2-request-urlopen – brandonscript Dec 06 '13 at 04:58

Download all the image files found using regex on a website to a specified directory in my computer in python

1 Answers1