Scraping a page for images but files are returned as empty

Question

I'm modifying this script to scrape pages like this for the book page images. Using the script directly from stackoverflow, it returns all the images correctly except the one image I want. The page is returned as empty file with a title like this: img.php?dir=39d761947ad84e71e51e3c300f7af8ff&file=1.png.

In my modified version below I'm only pulling the book page image.

Here's my script:

from bs4 import BeautifulSoup as bs
import urlparse
from urllib2 import urlopen
from urllib import urlretrieve
import os
import sys

out_folder = '/Users/Craig/Desktop/img'

def main(url, out_folder):
    soup = bs(urlopen(url))
    parsed = list(urlparse.urlparse(url))

    for image in soup.findAll('img', id='page_image'):
        print "Image: %(src)s" % image
        filename = image["src"].split("/")[-1]
        parsed[2] = image["src"]
        outpath = os.path.join(out_folder, filename)
        if image["src"].lower().startswith("http"):
            urlretrieve(image["src"], outpath)
        else:
            urlretrieve(urlparse.urlunparse(parsed), outpath)

def _usage():
    print "usage: python dumpimages.py http://example.com [outpath]"

if __name__ == "__main__":
    url = sys.argv[-1]
    if not url.lower().startswith("http"):
        out_folder = sys.argv[-1]
        url = sys.argv[-2]
        if not url.lower().startswith("http"):
            _usage()
            sys.exit(-1)
    main(url, out_folder)

Any ideas?

score 3 · Accepted Answer · answered Jul 27 '13 at 19:07

The issue here is that the url you are using to retrieve the image is:

http://bookre.org/loader/img.php?dir=39d761947ad84e71e51e3c300f7af8ff&file=1.png?file=1077091&pg=1

When you actually want it to be:

http://bookre.org/loader/img.php?dir=39d761947ad84e71e51e3c300f7af8ff&file=1.png

Here's something I hacked together in 2 minutes to download the image you required from the website you listed:

import urllib
import urllib2
import urlparse
from bs4 import BeautifulSoup

def main(url):
    html = urllib2.urlopen(url)
    soup = BeautifulSoup(html.read())

    parsed = list(urlparse.urlparse(url))

    for image in soup.find_all(id="page_image"):
        if image["src"].lower().startswith("http"):
            urllib.urlretrieve(image["src"], "image.png")
        else:
            new = (parsed[0], parsed[1], image["src"], "", "", "")
            urllib.urlretrieve(urlparse.urlunparse(new), "image.png")


if __name__ == '__main__':
    main("http://bookre.org/reader?file=1077091&pg=1")

The script saves the image as "image.png" in the directory the script is located in. Hope this is what you were after; let us know if you run into any difficulties.

score 0 · Answer 2 · answered Jul 27 '13 at 18:35

0

In your:

else:
    urlretrieve(urlparse.urlunparse(parsed), outpath)

You need to replace some of the elements in parsed with those from image["src"]

answered Jul 27 '13 at 18:35

Steve Barnes

27,618
6
63
73

Could you be a little bit more specific? – Craig Cannon Jul 27 '13 at 18:38

score 0 · Answer 3 · answered Jul 27 '13 at 18:46

0

So much easier with pyquery:

from pyquery import PyQuery as pq
image, = [img.attrib['src'] for img in pq(url=url)('img#page_image')]
...

(Note the funky use of name, = ['string'] to unroll the one-element list).

answered Jul 27 '13 at 18:46

swstephe

1,840
11
17

Scraping a page for images but files are returned as empty

3 Answers3