John Zwinck's advice in the question comments is pretty on the ball.
Part of the issue is that you have no control over the receiving server. When you place an excessive number of processes, you force the server on the other end to figure out the right way to handle all of your requests at once. That causes your processes to sit there idly waiting for the server to get back to them at some point - since pool.map()
only finishes when all of your processes finish (it is a blocking call), that means you wait for as long as it takes the server to service each of them.
Everything now depends on the server.
The server can choose to dedicate its resources to serving all of your requests one by one - that effectively means your requests are now waiting in a queue, offering no advantage than if you had just sent your requests serially, one by one. Single-threaded servers can be modelled like this, although their major speedup comes from the fact that they are asynchronous and jump rapidly between request and request.
Some servers typically have a small number of processes or threads that spawn a large number of child threads that all handle incoming requests one by one - the Apache server, for instance, starts off with 2 dedicated processes with 25 threads each, so theoretically it can handle 50 concurrent requests and scale as high as it is configured to. It will service as many as it can at this moment, and either put the remainder of your excess requests on hold or deny them service.
Some servers will simply kill or close connections if they threaten to overload the system or if an internal timeout is arrived at. The latter is more likely and more often encountered.
The other aspect of it is simply that your own CPU cores can't handle what you're asking them to do. A core can handle one thread at a time - when we speak of parallelism, we are really talking about multiple cores handling a thread simultaneously. Processes with a large number of smaller threads can have those threads be distributed among different CPU cores, so you can benefit from that.
But you have one hundred processes, each of which induce a blocking I/O call (urlopen
is blocking). If that I/O call is instantly responded to, so far so good - if not, now the other processes are waiting for this process to finish, taking up a valuable CPU core. You have successfully introduced waiting into a system where you want to explicitly avoid waiting. If you compound this issue with the stress you induce on the receiving server, you find a number of delays stemming from open connections.
Solutions
There are quite a few solutions, but in my own opinion they all boil down to the same thing:
Avoid blocking calls. Use a solution which fires off a request, puts the thread responsible for that to sleep and off the scheduler run queue, and wakes it up when an event is registered.
Use asynchronicity to your advantage. A single thread can make more than one request without blocking, you just have to be able to intelligently handle the responses as they come in one by one. You can even pass responses to other threads that aren't doing any work (like using a Queue
, for example). The trick is to get them to work together seamlessly.
multiprocessing
, though a good solution for handling processes, is not a bundled-in solution for handling the interaction between HTTP requests and the process's appropriate behaviour. This is logic you would usually have to write yourself, and it can be done if you had greater control over how urlopen
works - you'd have to figure out a way to make sure urlopen
doesn't block, or at least is willing to subscribe to event notifications immediately after sending a request.
Certainly, this can all be done - but web scraping is a solved problem, and there's no need to have to rewrite the wheel.
Instead, there are a couple of options that are tried and tested:
asyncio
is the standard as of Python 3.5. While not a full-fledged HTTP service, it offers asynchronous support for I/O bound operations. You can make HTTP requests using aiohttp
. Here's a tutorial on how to scrape with the same.
Scrapy is viable on Python 2.7 and Python 3. It uses Twisted, asyncio
's non-standard fore-runner and the go-to tool for fast network requests. I mention Scrapy instead of Twisted simply because Scrapy has already taken care of the underlying architecture for you [which can be read about here] - you should certainly explore Twisted to get a feel of the underlying system if you want to. It is the most hand-holdy of all the solutions I'll mention here, but also, in my experience, the most performant.
grequests
is an extension of the popular requests
library (which is incidentally superior to urllib2
and should be used at every opportunity) to support so-called coroutines: threads that can be suspended and resumed at multiple points in their execution, very ideal if you want the thread to do work while waiting for an I/O response. grequests
builds on top of gevent
(a coroutine library) to let you make multiple requests in a single thread, and handle them at your own pace.