1

I am trying to do an automated task via python through the mechanize module:

  1. Enter the keyword in a web form, submit the form.
  2. Look for a specific element in the response.

This works one-time. Now, I repeat this task for a list of keywords.

And am getting HTTP Error 429 (Too many requests).

I tried the following to workaround this:

  1. Adding custom headers (I noted them down specifically for that very website by using a proxy ) so that it looks a legit browser request .

    br=mechanize.Browser()
    br.addheaders = [('User-agent', 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36')]
    br.addheaders = [('Connection', 'keep-alive')]
    br.addheaders = [('Accept','text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8')]
    br.addheaders = [('Upgrade-Insecure-Requests','1')]
    br.addheaders = [('Accept-Encoding',' gzip, deflate, sdch')]
    br.addheaders = [('Accept-Language','en-US,en;q=0.8')]`
    
  2. Since the blocked response was coming for every 5th request , I tried sleeping for 20 sec after 5 requests .

Neither of the two methods worked.

sideshowbarker
  • 81,827
  • 26
  • 193
  • 197
  • just realised it takes tuple br.addheaders = [('user-agent', ' Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.2.3) Gecko/20100423 Ubuntu/10.04 (lucid) Firefox/3.6.3'), ('accept', 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8')] – CIph3r7r0ll Aug 16 '15 at 20:48
  • Related: http://stackoverflow.com/questions/7773567/web-scraper-limit-to-requests-per-minute-hour-on-single-domain <- this may show the actual request limit set forth by the site's administrator. – ivan_pozdeev Aug 16 '15 at 20:53
  • Getting a new error now . mechanize FormNotFoundError . On removing the header i get proper result . But again the limit problem . – CIph3r7r0ll Aug 16 '15 at 21:01

1 Answers1

0

You need to limit the rate of your requests to conform to what the server's configuration permits. (Web Scraper: Limit to Requests Per Minute/Hour on Single Domain? may show the permitted rate)

mechanize uses a heavily-patched version of urllib2 (Lib/site-packages/mechanize/_urllib2.py) for network operations, and its Browser class is a descendant of its _urllib2_fork.OpenerDirector.

So, the simplest method to patch its logic seems to add a handler to your Browser object

  • with default_open and appropriate handler_order to place it before everyone (lower is higher priority).
  • that would stall until the request is eligible with e.g. a Token bucket or Leaky bucket algorithm e.g. as implemented in Throttling with urllib2 . Note that a bucket should probably be per-domain or per-IP.
  • and finally return None to push the request to the following handlers

Since this is a common need, you should probably publish your implementation as an installable package.

Community
  • 1
  • 1
ivan_pozdeev
  • 33,874
  • 19
  • 107
  • 152