0

I'm modifying this script to scrape pages like this for the book page images. Using the script directly from stackoverflow, it returns all the images correctly except the one image I want. The page is returned as empty file with a title like this: img.php?dir=39d761947ad84e71e51e3c300f7af8ff&file=1.png.

In my modified version below I'm only pulling the book page image.

Here's my script:

from bs4 import BeautifulSoup as bs
import urlparse
from urllib2 import urlopen
from urllib import urlretrieve
import os
import sys

out_folder = '/Users/Craig/Desktop/img'

def main(url, out_folder):
    soup = bs(urlopen(url))
    parsed = list(urlparse.urlparse(url))

    for image in soup.findAll('img', id='page_image'):
        print "Image: %(src)s" % image
        filename = image["src"].split("/")[-1]
        parsed[2] = image["src"]
        outpath = os.path.join(out_folder, filename)
        if image["src"].lower().startswith("http"):
            urlretrieve(image["src"], outpath)
        else:
            urlretrieve(urlparse.urlunparse(parsed), outpath)

def _usage():
    print "usage: python dumpimages.py http://example.com [outpath]"

if __name__ == "__main__":
    url = sys.argv[-1]
    if not url.lower().startswith("http"):
        out_folder = sys.argv[-1]
        url = sys.argv[-2]
        if not url.lower().startswith("http"):
            _usage()
            sys.exit(-1)
    main(url, out_folder)

Any ideas?

Community
  • 1
  • 1
Craig Cannon
  • 1,449
  • 2
  • 13
  • 20

3 Answers3

3

The issue here is that the url you are using to retrieve the image is:

http://bookre.org/loader/img.php?dir=39d761947ad84e71e51e3c300f7af8ff&file=1.png?file=1077091&pg=1

When you actually want it to be:

http://bookre.org/loader/img.php?dir=39d761947ad84e71e51e3c300f7af8ff&file=1.png

Here's something I hacked together in 2 minutes to download the image you required from the website you listed:

import urllib
import urllib2
import urlparse
from bs4 import BeautifulSoup

def main(url):
    html = urllib2.urlopen(url)
    soup = BeautifulSoup(html.read())

    parsed = list(urlparse.urlparse(url))

    for image in soup.find_all(id="page_image"):
        if image["src"].lower().startswith("http"):
            urllib.urlretrieve(image["src"], "image.png")
        else:
            new = (parsed[0], parsed[1], image["src"], "", "", "")
            urllib.urlretrieve(urlparse.urlunparse(new), "image.png")


if __name__ == '__main__':
    main("http://bookre.org/reader?file=1077091&pg=1")

The script saves the image as "image.png" in the directory the script is located in. Hope this is what you were after; let us know if you run into any difficulties.

Hayden
  • 2,818
  • 2
  • 22
  • 30
0

In your:

else:
    urlretrieve(urlparse.urlunparse(parsed), outpath)

You need to replace some of the elements in parsed with those from image["src"]

Steve Barnes
  • 27,618
  • 6
  • 63
  • 73
0

So much easier with pyquery:

from pyquery import PyQuery as pq
image, = [img.attrib['src'] for img in pq(url=url)('img#page_image')]
...

(Note the funky use of name, = ['string'] to unroll the one-element list).

swstephe
  • 1,840
  • 11
  • 17