11

Possible Duplicate:
Multiple (asynchronous) connections with urllib2 or other http library?

I am working on a Linux web server that runs Python code to grab realtime data over HTTP from a 3rd party API. The data is put into a MySQL database. I need to make a lot of queries to a lot of URL's, and I need to do it fast (faster = better). Currently I'm using urllib3 as my HTTP library. What is the best way to go about this? Should I spawn multiple threads (if so, how many?) and have each query for a different URL? I would love to hear your thoughts about this - thanks!

Community
  • 1
  • 1
user1094786
  • 6,402
  • 7
  • 29
  • 42
  • There is a new answer that I can't add because this question was closed. The best way to do this today is using requests-futures https://github.com/ross/requests-futures – Chris Broski Jun 22 '18 at 18:05

3 Answers3

29

If a lot is really a lot than you probably want use asynchronous io not threads.

requests + gevent = grequests

GRequests allows you to use Requests with Gevent to make asynchronous HTTP Requests easily.

import grequests

urls = [
    'http://www.heroku.com',
    'http://tablib.org',
    'http://httpbin.org',
    'http://python-requests.org',
    'http://kennethreitz.com'
]

rs = (grequests.get(u) for u in urls)
grequests.map(rs)
iamtodor
  • 734
  • 8
  • 21
Piotr Dobrogost
  • 41,292
  • 40
  • 236
  • 366
  • 1
    I want to use this method for sending requests to about 50,000 urls. Is it a good strategy? Also, what about exceptions like timeout etc? – John Nov 26 '12 at 14:27
  • @John Yes, it is. As to exceptions see [`safe_mode`](http://requests.readthedocs.org/en/latest/api/) parameter and issue [953](https://github.com/kennethreitz/requests/pull/953) – Piotr Dobrogost Nov 26 '12 at 17:32
  • 5
    I can't send more than 30 requests using grequest. When I do, I get "Max retries exceeded with url: ..., Too many open files". Is there anyway to fix this problem? – AliBZ Aug 07 '13 at 23:44
  • 2
    Word of warning: grequests seems to be abandoned, and does not have error handling. My personal recommendation is https://github.com/ross/requests-futures , which is equally fast and, with backports, also works on 2.7. – Pedro Jun 05 '14 at 20:47
  • @droope it doesn't look like grequests is abandoned, and it seems easier to run on `python_ver < 3.4`. Do you have a link to the backports package you're talking about? This is the most popular package I see: https://pypi.python.org/pypi/backports.ssl_match_hostname – Ehtesh Choudhury Apr 02 '16 at 21:10
  • Why is asynchronous io better than threads to handle a lot of requests? Wouldn't python threads yield control whenever they call C extensions, thus ensuring that no time is wasted waiting on network I/O? Seems like it's no different from what async programming would achieve, but I'm probably missing some important consideration. – max Sep 16 '16 at 20:43
  • @max, the important consideration you are missing is that threads have a high amount of overhead since each thread has its own dedicated stack space. – PolyTekPatrick Nov 22 '16 at 13:45
1

You should use multithreading as well as pipelining requests. For example search->details->save

The number of threads you can use doesn't depend on your equipment only. How many requests the service can serve? How many concurrent requests does it allow to run? Even your bandwidth can be a bottleneck.

If you're talking about a kind of scraping - the service could block you after certain limit of requests, so you need to use proxies or multiple IP bindings.

As for me, in the most cases, I can run 50-300 concurrent requests on my laptop from python scripts.

Maksym Polshcha
  • 18,030
  • 8
  • 52
  • 77
  • Agree with Polscha, here. Most of the time, when you're making HTTP requests to an arbitrary service, most of the (clock) time expended is in waiting for for the network and the remote service to respond. So, within reason, the more threads, the better as at any given moment, most of those threads will just be in wait queues. Definitely heed Polscha's notes on service throttling. – parselmouth May 11 '12 at 16:54
  • thanks guys - the service is commercial and we are paying for it. it is very fast and will not be the bottleneck. in this case, what would be the best option? – user1094786 May 11 '12 at 17:03
  • @user1094786 In this case just try to build a pipeline of requests and experiment with a number of threads on each stage. Just try, sooner or later you'll found the upper limit :-) – Maksym Polshcha May 12 '12 at 12:13
0

Sounds like an excellent application for Twisted. Here are some web-related examples, including how to download a web page. Here is a related question on database connections with Twisted.

Note that Twisted does not rely on threads for doing multiple things at once. Rather, it takes a cooperative multitasking approach---your main script starts the reactor and the reactor calls functions that you set up. Your functions must return control to the reactor before the reactor can continue working.

Community
  • 1
  • 1
jrennie
  • 1,937
  • 12
  • 16