0

i use this script to download images from the same html page. But if images large enough, this script doesnt download properly - all images are 1,15 Kb and dont display. How can i fix it? what's wrong?

Community
  • 1
  • 1
DrStrangeLove
  • 11,227
  • 16
  • 59
  • 72

1 Answers1

7

If you download and inspect the HTML in http://tema.ru/travel/new-york.2011.11/, you see things like

<img src="IMG_5072.jpg" alt="" width="1000" height="667" border="1" />

So this page is using relative links.

The line

parsed[2] = image["src"]

changes parsed from

['http', 'tema.ru', '/travel/new-york.2011.11/', '', '', '']

to

['http', 'tema.ru', 'IMG_5072.jpg', '', '', '']

and then forms the new url with

url = urlparse.urlunparse(parsed)

which sets url to http://tema.ru/IMG_5072.jpg which does not exist. The correct url is http://tema.ru/travel/new-york.2011.11/IMG_5072.jpg.

We can form that url with

url = urlparse.urljoin(base_url,image['src'])

so try

"""
http://stackoverflow.com/a/258511/190597
Author: Ryan Ginstrom
dumpimages.py
    Downloads all the images on the supplied URL, and saves them to the
    specified output file ("/tmp" by default)

Usage:
    python dumpimages.py http://example.com/ [output]
"""
import os
import sys
import urllib
import urllib2
import urlparse
import argparse
import BeautifulSoup

def main(base_url, out_folder):
    """Downloads all the images at 'url' to out_folder"""
    soup = BeautifulSoup.BeautifulSoup(urllib2.urlopen(base_url))
    for image in soup.findAll("img"):
        src = image['src']
        print "Image: {s}".format(s=src) 
        _, filename = os.path.split(urlparse.urlsplit(src).path)
        outpath = os.path.join(out_folder, filename)
        url = urlparse.urljoin(base_url, src)
        urllib.urlretrieve(url, outpath)

if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument('url')
    parser.add_argument('out_folder', nargs = '?', default = '/tmp')
    args = parser.parse_args()
    main(args.url, args.out_folder)
unutbu
  • 842,883
  • 184
  • 1,785
  • 1,677
  • You can leave out the branch in `main` altogether. `urljoin("http://example.org/test.png", "http://google.com/test.png") == "http://google.com/test.png"` – Niklas B. Dec 28 '11 at 20:44
  • i copied and pasted your code. it didnt work! IOError [errno 2] no such file or directory: u' /tmp\\arr.gif' – DrStrangeLove Dec 28 '11 at 22:30
  • Change `default = '/tmp'` to a reasonable value for your Windows machine. Set it to some directory used to save downloads. Or, you can run the script with the download folder supplied as an additional argument: `python dumpimages.py http://tema.ru/travel/new-york.2011.11/ C:\\path\to\download\folder` – unutbu Dec 28 '11 at 22:55
  • @unutbu i tried python dumpimages.py http://tema.ru/travel/new-york.2011.11/ C:\\t It outputs - unrecognised arguments: C:\\t – DrStrangeLove Dec 28 '11 at 23:11
  • Sorry, that should have been `python dumpimages.py http://tema.ru/travel/new-york.2011.11/ --out C:\\t`. – unutbu Dec 28 '11 at 23:51
  • @unutbu it still doesnt work!! it outputs IOError [errno 2] no such file or directory: u'C:\\\\t\\arr.gif' – DrStrangeLove Dec 29 '11 at 00:20
  • Okay, one more try: `python dumpimages.py http://tema.ru/travel/new-york.2011.11/ --out C:\t`. – unutbu Dec 29 '11 at 01:57
  • @unutbu Still,no luck :(( now it's IOError [errno 2] no such file or directory: u'C:\\t\\arr.gif' – DrStrangeLove Dec 29 '11 at 02:10
  • 1
    Okay, I clearly have no clue how to operate in a Windows environment. What is the path to the desired directory? Do you have a directory named `t` at the top level of the `C:` drive? – unutbu Dec 29 '11 at 02:19
  • @unutbu Thanks!! i manually created t folder as C:\t and it worked!!:)) I thought it created t folder if that folder didn't exist :( – DrStrangeLove Dec 29 '11 at 02:27
  • Oh, wonderful. Yay! Now that the main problem is solved, I've changed the argparser so `python dumpimages.py http://tema.ru/travel/new-york.2011.11/ C:\t` should work (without the `--out`). – unutbu Dec 29 '11 at 02:33
  • I am getting invalid syntax for : `print "Image {s}".format(s=src)` at the second quote. Any ideas? – tehaaron Dec 31 '11 at 06:00
  • oops forgot I upgraded to python 3 – tehaaron Dec 31 '11 at 06:13