Batch downloading text and images from URL with Python / urllib / beautifulsoup?

Question

I have been browsing through several posts here but I just cannot get my head around batch-downloading images and text from a given URL with Python.

import urllib,urllib2
import urlparse
from BeautifulSoup import BeautifulSoup
import os, sys

def getAllImages(url):
    query = urllib2.Request(url)
    user_agent = "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; .NET CLR 1.1.4322; .NET CLR 1.0.3705)"
    query.add_header("User-Agent", user_agent)

    page = BeautifulSoup(urllib2.urlopen(query))
    for div in page.findAll("div", {"class": "thumbnail"}):
        print "found thumbnail"
        for img in div.findAll("img"):
            print "found image"
            src = img["src"]
            if src:
                src = absolutize(src, pageurl)
                f = open(src,'wb')
                f.write(urllib.urlopen(src).read())
                f.close()
        for h5 in div.findAll("h5"):
            print "found Headline"
            value = (h5.contents[0])
            print >> headlines.txt, value


def main():
    getAllImages("http://www.nytimes.com/")

Above is now some updated code. What happens, is nothing. The code does not get to find any div with a thumbnail, obviously, no result in any of the print.... So probably I am missing some pointers in getting to the right divs containing the images and headlines?

Thanks a lot!

You might get more detailed answers if you could explain what exact issues you are running into when you try to download the file. Have you read posts like http://stackoverflow.com/questions/3042757/downloading-a-picture-via-urllib-and-python, which contain code for downloading images in their answers? — Martey, Oct 27 '11 at 15:31

Sean Vieira · Accepted Answer · 2014-06-19T14:14:59.157

1

The OS you are using doesn't know how to write to the file path you are passing it in src. Make sure that the name you use to save the file to disk is one the OS can actually use:

src = "abc.com/alpha/beta/charlie.jpg"
with open(src, "wb") as f:
    # IOError - cannot open file abc.com/alpha/beta/charlie.jpg

src = "alpha/beta/charlie.jpg"
os.makedirs(os.path.dirname(src))
with open(src, "wb" as f:
    # Golden - write file here

and everything will start working.

A couple of additional thoughts:

Make sure to normalize the save file path (e. g. os.path.join(some_root_dir, *relative_file_path*)) - otherwise you'll be writing images all over your hard drive depending on their src.
Unless you are running tests of some kind, it's good to advertise that you are a bot in your user_agent string and honor robots.txt files (or alternately, provide some kind of contact information so people can ask you to stop if they need to).

edited Jun 19 '14 at 14:14

answered Oct 27 '11 at 16:54

Sean Vieira

155,703
32
311
293

Thanks a lot for the quick reply, unfortunately, after having changed that one line, still I get no result in any way. Running the code just results in nothing.... :( – birgit Oct 27 '11 at 17:19
Traceback (most recent call last): File "test.py", line 40, in main() File "test.py", line 35, in main call = getAllImages("http://www.nytimes.com/") File "test.py", line 21, in getAllImages f = open(src,'wb') IOError: [Errno 2] No such file or directory: u'http://i1.nyt.com/images/2011/10/27/us/cain1/cain1-thumbStandard.jpg' ..... is this the point where the normalizing of the part comes into play!? – birgit Oct 27 '11 at 17:32
@user1016690 - yes, that is what I was talking about. You're trying to open a file on your hard drive at `http://i1.nyt.com/images/2011/10/27/us/cain1/cain1-thumbStandard.jpg` ... and the OS legitimately complains that there is not a writable device called `http://`. :-) – Sean Vieira Oct 27 '11 at 17:38
BeautifulSoup4 hadles file-like objects just fine. Just sayin'. – Martijn Pieters Jun 19 '14 at 10:22
@Martijn - *chuckles* yes, indeed it does. Updated to address the actual cause of the problem. Thanks for helping make the answer better! – Sean Vieira Jun 19 '14 at 14:15

Batch downloading text and images from URL with Python / urllib / beautifulsoup?

1 Answers1

Linked