Try putting print html
after you do your page.read()
. You may not be getting what you think you are, it sounds like you're receiving an error page rather than the file itself. I'm not sure if you're even handling the file correctly, you might find a better approach here: Download all the links(related documents) on a webpage using Python.
The zip file isn't 4KB, btw, it's ~87MB and contains a 784MB XML file, which you should be able to confirm by hitting that URL in a browser and downloading it. It may not be an infinite loop that's the problem, it's just taking a long time to load.
You're also trying to pass the data in as HTML when it's zip-archived XML. If (once you actually have the file) you store the response data in a StringIO
, you'll be able to unzip it in memory (as outlined here). You will then need to explicitly tell BeautifulSoup
that you're passing it XML.
soup = BeautifulSoup(html, 'xml')
This will require you to install lxml, but that will work out to your advantage, as it's possibly the fastest XML parser under Python.
One last thing:
mech.set_handle_robots(False)
url = "http://storage.googleapis.com/patents/retro/2011/ad20111231-02.zip"
I was under the impression Google set up their robots.txt
to disallow scraping as much as possible. If you're still unable to even download a copy of the file, I'd recommend trying Selenium
; it's a lot like mechanize
but controls actual browsers, like Chrome & Firefox, so it will be a legitimate browser request.