Python Mechanize HTTP Error 403: request disallowed by robots.txt

Question

So, I created a Django website to web-scrape news webpages for articles.. Even though i use mechanize, i they still telling me:

HTTP Error 403: request disallowed by robots.txt

I tried everything, look at my code(Just the part to scrape):

br = mechanize.Browser()
page = br.open(web)
br.set_handle_robots(False)
br.set_handle_equiv(False)
br.addheaders = [('User-agent', 'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.1) Gecko/2008071615 Fedora/3.0.1-1.fc9 Firefox/3.0.1')]
    #BeautifulSoup 
htmlcontent = page.read()
soup = BeautifulSoup(htmlcontent)

I tried too to use de br.open before the set_hande_robots(Flase) ,etc. It didn't work either.

Any way to get trough this sites?

They are disallowed because those sites don't want any bot to access their resources. There might be legal terms. You should stay away from them. — Bibhas Debnath, Sep 16 '13 at 06:33

score 5 · Answer 1 · answered Jan 19 '14 at 20:09

You're setting br.set_handle_robots(False) after br.open()

It should be:

br = mechanize.Browser()
br.set_handle_robots(False)
br.set_handle_equiv(False)
br.addheaders = [('User-agent', 'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.1) Gecko/2008071615 Fedora/3.0.1-1.fc9 Firefox/3.0.1')]
page = br.open(web)
htmlcontent = page.read()
soup = BeautifulSoup(htmlcontent)

Python Mechanize HTTP Error 403: request disallowed by robots.txt

1 Answers1