1

So, I created a Django website to web-scrape news webpages for articles.. Even though i use mechanize, i they still telling me:

HTTP Error 403: request disallowed by robots.txt 

I tried everything, look at my code(Just the part to scrape):

br = mechanize.Browser()
page = br.open(web)
br.set_handle_robots(False)
br.set_handle_equiv(False)
br.addheaders = [('User-agent', 'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.1) Gecko/2008071615 Fedora/3.0.1-1.fc9 Firefox/3.0.1')]
    #BeautifulSoup 
htmlcontent = page.read()
soup = BeautifulSoup(htmlcontent)

I tried too to use de br.open before the set_hande_robots(Flase) ,etc. It didn't work either.

Any way to get trough this sites?

Julian Slonim
  • 11
  • 1
  • 2
  • They are disallowed because those sites don't want any bot to access their resources. There might be legal terms. You should stay away from them. – Bibhas Debnath Sep 16 '13 at 06:33

1 Answers1

5

You're setting br.set_handle_robots(False) after br.open()

It should be:

br = mechanize.Browser()
br.set_handle_robots(False)
br.set_handle_equiv(False)
br.addheaders = [('User-agent', 'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.1) Gecko/2008071615 Fedora/3.0.1-1.fc9 Firefox/3.0.1')]
page = br.open(web)
htmlcontent = page.read()
soup = BeautifulSoup(htmlcontent)
Crypto
  • 1,217
  • 3
  • 17
  • 33