accessing a list of urls via Multi-processing in Python2.7

Question

I've been playing with multiprocessing and my code works when looking at smaller numbers but when I want to run a larger sample 2 things happen: Either the code locks up or I get the following error message: "urlopen error [Errno 10054] An existing connection was forcibly closed by the remote host" . I can't figure out how to get it to work. Thanks.

    from multiprocessing import cpu_count
    import urllib2
    from bs4 import BeautifulSoup
    import json
    import timeit
    import socket
    import errno

    def parseWeb(id):
        url = 'https://carhood.com.au/rent/car_detail/'+str(id)+'/'
        hdr = {'Accept': 'text/html,application/xhtml+xml,*/*',"user-agent":"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/48.0.2564.116 Safari/537.36"}

        html = urllib2.urlopen(url).read()
        soup=BeautifulSoup(html,"lxml")
        car=soup.find("h1",{"class":"intro-title intro-title-tertiary"}).text

        return car

    if __name__ == '__main__':
        start = timeit.default_timer()
        pool = Pool(cpu_count()*100)
#This works for me when xrange(1,70)
        results=pool.map(parseWeb,xrange(1,400))
        print results
##I've tried this as a solution but it didn't work
        ##    startNum=1
##    endNum=470
##    for x in range(startNum,endNum,70):
##        print x
##        results=pool.map(parseWeb,xrange(startNum,x))
##        print results
##        startNum=x  
        stop = timeit.default_timer()
        print stop - start

Is it really reasonable to spawn 100 processes per CPU core? I don't think it is. It's also very mean to the website owner. — John Zwinck, Apr 24 '16 at 04:59
I suppose two things I should caveat with: 1) I'm just trying to learn how to do it, so happy to hear what a more reasonable process number is hence why I'm asking and 2) hence why I'm asking how to loop through it more correctly, I'd imaging if there's a way to place an explicit wait or something to that effect I could be more considerate, I just haven't figured out how to do it. — FancyDolphin, Apr 24 '16 at 05:05
OK, here's my advice: I think more than 3 or 4 simultaneous requests to one domain is excessive and should be avoided. If possible, try to find a resource on the domain that lets you download more data with a single request. Do not spawn hundreds of processes. — John Zwinck, Apr 24 '16 at 05:19
And you should really check if a site has a [/robots.txt](http://www.robotstxt.org/robotstxt.html) file and respect its contents before you go crawling all over it. — Roland Smith, Apr 24 '16 at 10:28

score 2 · Accepted Answer · edited May 23 '17 at 10:28

John Zwinck's advice in the question comments is pretty on the ball.

Part of the issue is that you have no control over the receiving server. When you place an excessive number of processes, you force the server on the other end to figure out the right way to handle all of your requests at once. That causes your processes to sit there idly waiting for the server to get back to them at some point - since pool.map() only finishes when all of your processes finish (it is a blocking call), that means you wait for as long as it takes the server to service each of them.

Everything now depends on the server.

The server can choose to dedicate its resources to serving all of your requests one by one - that effectively means your requests are now waiting in a queue, offering no advantage than if you had just sent your requests serially, one by one. Single-threaded servers can be modelled like this, although their major speedup comes from the fact that they are asynchronous and jump rapidly between request and request.
Some servers typically have a small number of processes or threads that spawn a large number of child threads that all handle incoming requests one by one - the Apache server, for instance, starts off with 2 dedicated processes with 25 threads each, so theoretically it can handle 50 concurrent requests and scale as high as it is configured to. It will service as many as it can at this moment, and either put the remainder of your excess requests on hold or deny them service.
Some servers will simply kill or close connections if they threaten to overload the system or if an internal timeout is arrived at. The latter is more likely and more often encountered.

The other aspect of it is simply that your own CPU cores can't handle what you're asking them to do. A core can handle one thread at a time - when we speak of parallelism, we are really talking about multiple cores handling a thread simultaneously. Processes with a large number of smaller threads can have those threads be distributed among different CPU cores, so you can benefit from that.

But you have one hundred processes, each of which induce a blocking I/O call (urlopen is blocking). If that I/O call is instantly responded to, so far so good - if not, now the other processes are waiting for this process to finish, taking up a valuable CPU core. You have successfully introduced waiting into a system where you want to explicitly avoid waiting. If you compound this issue with the stress you induce on the receiving server, you find a number of delays stemming from open connections.

Solutions

There are quite a few solutions, but in my own opinion they all boil down to the same thing:

Avoid blocking calls. Use a solution which fires off a request, puts the thread responsible for that to sleep and off the scheduler run queue, and wakes it up when an event is registered.
Use asynchronicity to your advantage. A single thread can make more than one request without blocking, you just have to be able to intelligently handle the responses as they come in one by one. You can even pass responses to other threads that aren't doing any work (like using a Queue, for example). The trick is to get them to work together seamlessly.

multiprocessing, though a good solution for handling processes, is not a bundled-in solution for handling the interaction between HTTP requests and the process's appropriate behaviour. This is logic you would usually have to write yourself, and it can be done if you had greater control over how urlopen works - you'd have to figure out a way to make sure urlopen doesn't block, or at least is willing to subscribe to event notifications immediately after sending a request.

Certainly, this can all be done - but web scraping is a solved problem, and there's no need to have to rewrite the wheel.

Instead, there are a couple of options that are tried and tested:

asyncio is the standard as of Python 3.5. While not a full-fledged HTTP service, it offers asynchronous support for I/O bound operations. You can make HTTP requests using aiohttp. Here's a tutorial on how to scrape with the same.
Scrapy is viable on Python 2.7 and Python 3. It uses Twisted, asyncio's non-standard fore-runner and the go-to tool for fast network requests. I mention Scrapy instead of Twisted simply because Scrapy has already taken care of the underlying architecture for you [which can be read about here] - you should certainly explore Twisted to get a feel of the underlying system if you want to. It is the most hand-holdy of all the solutions I'll mention here, but also, in my experience, the most performant.
grequests is an extension of the popular requests library (which is incidentally superior to urllib2 and should be used at every opportunity) to support so-called coroutines: threads that can be suspended and resumed at multiple points in their execution, very ideal if you want the thread to do work while waiting for an I/O response. grequests builds on top of gevent (a coroutine library) to let you make multiple requests in a single thread, and handle them at your own pace.

accessing a list of urls via Multi-processing in Python2.7

1 Answers1

Solutions