Python web scraping pauses

Question

I have the following code:

#!/usr/bin/env python
from mechanize import Browser
from BeautifulSoup import BeautifulSoup

mech = Browser()
mech.set_handle_robots(False)
url = "http://storage.googleapis.com/patents/retro/2011/ad20111231-02.zip"
page = mech.open(url)
html = page.read()

soup = BeautifulSoup(html)
print soup.prettify()

really simple web scraper trying to download a .zip file from a web page. When I run this code, and bearing in mind this file is 4kb, the program just does not finish, as if it is in an infinite while loop. what I have done here?

How long have you waited for it to finish? I just tried with `timeit` (`python -m timeit -n 1 -r 1 -s "import requests" "r = requests.get('http://storage.googleapis.com/patents/retro/2011/ad20111231-02.zip').content"`) and it took 113 seconds. Did you wait at least that long? — jdotjdot, Oct 11 '12 at 04:53

score 2 · Accepted Answer · edited May 23 '17 at 12:18

Try putting print html after you do your page.read(). You may not be getting what you think you are, it sounds like you're receiving an error page rather than the file itself. I'm not sure if you're even handling the file correctly, you might find a better approach here: Download all the links(related documents) on a webpage using Python.

The zip file isn't 4KB, btw, it's ~87MB and contains a 784MB XML file, which you should be able to confirm by hitting that URL in a browser and downloading it. It may not be an infinite loop that's the problem, it's just taking a long time to load.

You're also trying to pass the data in as HTML when it's zip-archived XML. If (once you actually have the file) you store the response data in a StringIO, you'll be able to unzip it in memory (as outlined here). You will then need to explicitly tell BeautifulSoup that you're passing it XML.

soup = BeautifulSoup(html, 'xml')

This will require you to install lxml, but that will work out to your advantage, as it's possibly the fastest XML parser under Python.

One last thing:

mech.set_handle_robots(False)
url = "http://storage.googleapis.com/patents/retro/2011/ad20111231-02.zip"

I was under the impression Google set up their robots.txt to disallow scraping as much as possible. If you're still unable to even download a copy of the file, I'd recommend trying Selenium; it's a lot like mechanize but controls actual browsers, like Chrome & Firefox, so it will be a legitimate browser request.

"You're also trying to pass the data in as HTML when it's XML" - He appears not to be unzipping it first so he's actually trying to parse the zip file which probably isn't helping. — George, Oct 11 '12 at 16:55

Python web scraping pauses

1 Answers1