Python mechanize returns HTTP 429 error

Question

I am trying to do an automated task via python through the mechanize module:

Enter the keyword in a web form, submit the form.
Look for a specific element in the response.

This works one-time. Now, I repeat this task for a list of keywords.

And am getting HTTP Error 429 (Too many requests).

I tried the following to workaround this:

Adding custom headers (I noted them down specifically for that very website by using a proxy ) so that it looks a legit browser request .

br=mechanize.Browser()
br.addheaders = [('User-agent', 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36')]
br.addheaders = [('Connection', 'keep-alive')]
br.addheaders = [('Accept','text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8')]
br.addheaders = [('Upgrade-Insecure-Requests','1')]
br.addheaders = [('Accept-Encoding',' gzip, deflate, sdch')]
br.addheaders = [('Accept-Language','en-US,en;q=0.8')]`

Since the blocked response was coming for every 5th request , I tried sleeping for 20 sec after 5 requests .

Neither of the two methods worked.

just realised it takes tuple br.addheaders = [('user-agent', ' Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.2.3) Gecko/20100423 Ubuntu/10.04 (lucid) Firefox/3.6.3'), ('accept', 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8')] — CIph3r7r0ll, Aug 16 '15 at 20:48
Related: http://stackoverflow.com/questions/7773567/web-scraper-limit-to-requests-per-minute-hour-on-single-domain <- this may show the actual request limit set forth by the site's administrator. — ivan_pozdeev, Aug 16 '15 at 20:53
Getting a new error now . mechanize FormNotFoundError . On removing the header i get proper result . But again the limit problem . — CIph3r7r0ll, Aug 16 '15 at 21:01

score 0 · Answer 1 · edited May 23 '17 at 12:14

You need to limit the rate of your requests to conform to what the server's configuration permits. (Web Scraper: Limit to Requests Per Minute/Hour on Single Domain? may show the permitted rate)

mechanize uses a heavily-patched version of urllib2 (Lib/site-packages/mechanize/_urllib2.py) for network operations, and its Browser class is a descendant of its _urllib2_fork.OpenerDirector.

So, the simplest method to patch its logic seems to add a handler to your Browser object

with default_open and appropriate handler_order to place it before everyone (lower is higher priority).
that would stall until the request is eligible with e.g. a Token bucket or Leaky bucket algorithm e.g. as implemented in Throttling with urllib2 . Note that a bucket should probably be per-domain or per-IP.
and finally return None to push the request to the following handlers

Since this is a common need, you should probably publish your implementation as an installable package.

Python mechanize returns HTTP 429 error

1 Answers1