0

I have a code here that finds all the image files using regex by looking up its file extension. Now what I want to do is save it to a specified path on my computer and also preserving its original filenames. My current code finds the images because I tested by printing the 'source' but doesn't save it to the specified directory maybe anyone can help me tweak the code.

Thanks in advance.

Here's my code:

import urllib,re,os

_in = raw_input('< Press enter to download images from first page >')
if not os.path.exists('FailImages'): # Directory that I want to save the image to
        os.mkdir('FailImages') # If no directory create it

source = urllib.urlopen('http://www.samplewebpage.com/index.html').read()

imgs = re.findall('\w+.jpg',source) # regex finds files with .jpg extension

# This bit that needs tweaking

for img in imgs:
        filename = 'src="'+ img.split('/')[0]
        if not os.path.exists(filename):
                urllib.urlretrieve(img,filename)
alko
  • 46,136
  • 12
  • 94
  • 102
user3034404
  • 17
  • 2
  • 6
  • I suspect you're going to have a much more challenging task on your hands than simply dumping all the image files into a folder. That will work well only if the images aren't named identically. Your best bet would be to capture the relative path to the image (for local images) and recreate the folder structure locally; for external images, you may want to create a similar structure, but contained inside a folder like `www.externalimage.com`. – brandonscript Dec 05 '13 at 07:37
  • It doesnt matter if the images on the page have the same file names – user3034404 Dec 05 '13 at 12:53
  • Even if some get overwritten? (1.jpg will overwrite 1.jpg)? – brandonscript Dec 05 '13 at 16:04
  • yes, i just need a simple code that will download/save images from a website to my folder. The code doesn't have to be robust. – user3034404 Dec 05 '13 at 22:39

1 Answers1

1

This should get you going. It's not handling whether or not it's an external link, but it will grab local images,

Optional

  1. install dependency requests from http://requests.readthedocs.org/en/latest/
  2. From the command line, execute:
  3. $ sudo easy_install requests

If using requests, uncomment the 3 f.____ lines and #comment out the last urllib.urlretrieve line:

import urllib2,re,os
#import requests

folder = "FailImages"

if not os.path.exists(folder): # Directory that I want to save the image to
    os.mkdir(folder) # If no directory create it

url = "http://www.google.ca"
source = urllib2.urlopen(url).read()

imgs = re.findall(r'(https?:/)?(/?[\w_\-&%?./]*?)\.(jpg|png|gif)',source, re.M) # regex finds files with .jpg extension


for img in imgs:
    remote = url + img[1] + "." + img[2];
    filename = folder + "/" + img[1].split('/')[-1] + "." + img[2]
    print "Copying from " + remote + " to " + filename
    if not os.path.exists(filename):
        f = open(filename, 'wb')
        f.write(urllib2.urlopen(remote).read())
        #f.write(requests.get(remote).content)
        f.close()

Note: Requests works a lot better and ensures the correct headers are sent, urllib may not work much of the time.

brandonscript
  • 68,675
  • 32
  • 163
  • 220
  • Thanks for the code. But I must use the standard module on Python. So no installing of modules. – user3034404 Dec 05 '13 at 23:37
  • I got it almost working. what i did was uncomment the f.close and f=open but left the f.write commented because its giving me error'requests not defined'. It fetched and saved the images with their original filenames which is what I wanted BUT contains nothing like no bytes just a file in the folder. Any suggestions? Thanks in advance – user3034404 Dec 06 '13 at 01:09
  • That's the problem you're going to get with urlretrieve - it doesn't pass the correct headers. If my power hadn't just gone out, I'd edit it with url retrieve inside the f.write. If it comes back on, I'll update. Try yourself if you can, wrapping url retrieve in f.write() – brandonscript Dec 06 '13 at 01:15
  • I got it sort of working now. It downloads only partial size of the file like for one of the images, it only downloads 246bytes instead of 43KB. What do you think should I do? – user3034404 Dec 06 '13 at 01:30
  • Got it-- fixed it up to use urllib2 (why I didn't use in the first place?) – brandonscript Dec 06 '13 at 01:31
  • It downloads fine on a Google website but when I try it on the website that Im gonna use it on, i get these errors File "C:\Python27\lib\urllib2.py", line 531, in http_error_default raise HTTPError(req.get_full_url(), code, msg, hdrs, fp) HTTPError: HTTP Error 404: Not Found – user3034404 Dec 06 '13 at 01:42
  • Can I ask, I also want to display if there are some files that didn't download. Can you possibly help me to do this? – user3034404 Dec 06 '13 at 02:05
  • Great stuff. Good luck. – brandonscript Dec 06 '13 at 02:14
  • Best to read up on urllib2 and error handling. A good resource: http://stackoverflow.com/questions/666022/what-errors-exceptions-do-i-need-to-handle-with-urllib2-request-urlopen – brandonscript Dec 06 '13 at 04:58