i use this script to download images from the same html page. But if images large enough, this script doesnt download properly - all images are 1,15 Kb and dont display. How can i fix it? what's wrong?
Asked
Active
Viewed 1,004 times
0
-
Can you post an example page where the problem occurs? – Niklas B. Dec 28 '11 at 19:38
-
How about an example URL that the script fails on? – Matt Ball Dec 28 '11 at 19:39
-
http://tema.ru/travel/new-york.2011.11/ – DrStrangeLove Dec 28 '11 at 19:40
1 Answers
7
If you download and inspect the HTML in http://tema.ru/travel/new-york.2011.11/, you see things like
<img src="IMG_5072.jpg" alt="" width="1000" height="667" border="1" />
So this page is using relative links.
The line
parsed[2] = image["src"]
changes parsed
from
['http', 'tema.ru', '/travel/new-york.2011.11/', '', '', '']
to
['http', 'tema.ru', 'IMG_5072.jpg', '', '', '']
and then forms the new url with
url = urlparse.urlunparse(parsed)
which sets url
to http://tema.ru/IMG_5072.jpg
which does not exist.
The correct url is http://tema.ru/travel/new-york.2011.11/IMG_5072.jpg
.
We can form that url with
url = urlparse.urljoin(base_url,image['src'])
so try
"""
http://stackoverflow.com/a/258511/190597
Author: Ryan Ginstrom
dumpimages.py
Downloads all the images on the supplied URL, and saves them to the
specified output file ("/tmp" by default)
Usage:
python dumpimages.py http://example.com/ [output]
"""
import os
import sys
import urllib
import urllib2
import urlparse
import argparse
import BeautifulSoup
def main(base_url, out_folder):
"""Downloads all the images at 'url' to out_folder"""
soup = BeautifulSoup.BeautifulSoup(urllib2.urlopen(base_url))
for image in soup.findAll("img"):
src = image['src']
print "Image: {s}".format(s=src)
_, filename = os.path.split(urlparse.urlsplit(src).path)
outpath = os.path.join(out_folder, filename)
url = urlparse.urljoin(base_url, src)
urllib.urlretrieve(url, outpath)
if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument('url')
parser.add_argument('out_folder', nargs = '?', default = '/tmp')
args = parser.parse_args()
main(args.url, args.out_folder)

unutbu
- 842,883
- 184
- 1,785
- 1,677
-
You can leave out the branch in `main` altogether. `urljoin("http://example.org/test.png", "http://google.com/test.png") == "http://google.com/test.png"` – Niklas B. Dec 28 '11 at 20:44
-
i copied and pasted your code. it didnt work! IOError [errno 2] no such file or directory: u' /tmp\\arr.gif' – DrStrangeLove Dec 28 '11 at 22:30
-
Change `default = '/tmp'` to a reasonable value for your Windows machine. Set it to some directory used to save downloads. Or, you can run the script with the download folder supplied as an additional argument: `python dumpimages.py http://tema.ru/travel/new-york.2011.11/ C:\\path\to\download\folder` – unutbu Dec 28 '11 at 22:55
-
@unutbu i tried python dumpimages.py http://tema.ru/travel/new-york.2011.11/ C:\\t It outputs - unrecognised arguments: C:\\t – DrStrangeLove Dec 28 '11 at 23:11
-
Sorry, that should have been `python dumpimages.py http://tema.ru/travel/new-york.2011.11/ --out C:\\t`. – unutbu Dec 28 '11 at 23:51
-
@unutbu it still doesnt work!! it outputs IOError [errno 2] no such file or directory: u'C:\\\\t\\arr.gif' – DrStrangeLove Dec 29 '11 at 00:20
-
Okay, one more try: `python dumpimages.py http://tema.ru/travel/new-york.2011.11/ --out C:\t`. – unutbu Dec 29 '11 at 01:57
-
@unutbu Still,no luck :(( now it's IOError [errno 2] no such file or directory: u'C:\\t\\arr.gif' – DrStrangeLove Dec 29 '11 at 02:10
-
1Okay, I clearly have no clue how to operate in a Windows environment. What is the path to the desired directory? Do you have a directory named `t` at the top level of the `C:` drive? – unutbu Dec 29 '11 at 02:19
-
@unutbu Thanks!! i manually created t folder as C:\t and it worked!!:)) I thought it created t folder if that folder didn't exist :( – DrStrangeLove Dec 29 '11 at 02:27
-
Oh, wonderful. Yay! Now that the main problem is solved, I've changed the argparser so `python dumpimages.py http://tema.ru/travel/new-york.2011.11/ C:\t` should work (without the `--out`). – unutbu Dec 29 '11 at 02:33
-
I am getting invalid syntax for : `print "Image {s}".format(s=src)` at the second quote. Any ideas? – tehaaron Dec 31 '11 at 06:00
-