Using Mechanize for Scraping, encountered HTTP Error 403

Question

After getting

mechanize._response.httperror_seek_wrapper: HTTP Error 403: request disallowed by robots.txt

when using Mechanize, added code from Screen scraping: getting around "HTTP Error 403: request disallowed by robots.txt" to ignore robots.txt, but now am receiving this error:

mechanize._response.httperror_seek_wrapper: HTTP Error 403: Forbidden

Is there a way around this error?

(Current code)

br = mechanize.Browser()
br.set_handle_robots(False)

score 8 · Answer 1 · answered Oct 04 '15 at 12:56

8

Adding this line of code underneath the two lines of current code posted above solved the issue I was having:

br.addheaders = [('User-agent', 'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.1) Gecko/2008071615 Fedora/3.0.1-1.fc9 Firefox/3.0.1')]

answered Oct 04 '15 at 12:56

DanielSon

1,415
4
27
40

score 0 · Answer 2 · edited Oct 25 '16 at 16:31

HTTP Error 403: request disallowed by robots.txt

is confusing and semantically wrong here. This is not a server returned HTTP 403 error but a made-up/hard-coded 403 from the client based on parsing a robots.txt file.

This library goes against the most basic HTTP specification. HTTP errors are created by HTTP servers for the privilege of clients. They should not be created by the clients. It is like driving the wrong way on the motorway and blaming other drivers for it.

Using Mechanize for Scraping, encountered HTTP Error 403

2 Answers2