38

I've been trying to build a scraper with multithreading functionality past two days. Somehow I still couldn't manage it. At first I tried regular multithreading approach with threading module but it wasn't faster than using a single thread. Later I learnt that requests is blocking and multithreading approach isn't really working. So I kept researching and found out about grequests and gevent. Now I'm running tests with gevent and it's still not faster than using a single thread. Is my coding wrong?

Here is the relevant part of my class:

import gevent.monkey
from gevent.pool import Pool
import requests

gevent.monkey.patch_all()

class Test:
    def __init__(self):
        self.session = requests.Session()
        self.pool = Pool(20)
        self.urls = [...urls...]

    def fetch(self, url):

        try:
            response = self.session.get(url, headers=self.headers)
        except:
            self.logger.error('Problem: ', id, exc_info=True)

        self.doSomething(response)

    def async(self):
        for url in self.urls:
            self.pool.spawn( self.fetch, url )

        self.pool.join()

test = Test()
test.async()
Michał Leon
  • 2,108
  • 1
  • 15
  • 15
krypt
  • 439
  • 1
  • 4
  • 13
  • Where are your imports? Also have you tried the `multiprocessing` module? – Will Jul 09 '16 at 09:04
  • I've added the imports. Sorry, I didn't think it would be necessary. I haven't tried multiprocessing but why wouldn't gevent work? – krypt Jul 09 '16 at 09:14
  • No problem! Try changing `gevent.monkey.patch_all()` to `gevent.monkey.patch_all(httplib=True)`. If that helps I'll explain it. – Will Jul 09 '16 at 09:21
  • It doesn't seem to be supported anymore. ValueError: gevent.httplib is no longer provided, httplib must be False – krypt Jul 09 '16 at 09:27
  • You're right! `grequests` is the new way; see my answer. – Will Jul 09 '16 at 09:39
  • 9
    I would question your assumption that "requests is blocking and multithreading approach isn't really working". While it is true that requests is blocking, it is only blocking the thread in which it is running. (Python's GIL doesn't pose a problem here because it is internally released on blocking network calls, relevant to `requests`.) There should be no problems running `requests` in multiple threads. – user4815162342 Jul 09 '16 at 09:39
  • @user4815162342 You may be right as I'm not very well informed in these topics. I assumed that because I saw no improvements in speed when multithreading over a single process approach. – krypt Jul 09 '16 at 09:42

1 Answers1

41

Install the grequests module which works with gevent (requests is not designed for async):

pip install grequests

Then change the code to something like this:

import grequests

class Test:
    def __init__(self):
        self.urls = [
            'http://www.example.com',
            'http://www.google.com', 
            'http://www.yahoo.com',
            'http://www.stackoverflow.com/',
            'http://www.reddit.com/'
        ]

    def exception(self, request, exception):
        print "Problem: {}: {}".format(request.url, exception)

    def async(self):
        results = grequests.map((grequests.get(u) for u in self.urls), exception_handler=self.exception, size=5)
        print results

test = Test()
test.async()

This is officially recommended by the requests project:

Blocking Or Non-Blocking?

With the default Transport Adapter in place, Requests does not provide any kind of non-blocking IO. The Response.content property will block until the entire response has been downloaded. If you require more granularity, the streaming features of the library (see Streaming Requests) allow you to retrieve smaller quantities of the response at a time. However, these calls will still block.

If you are concerned about the use of blocking IO, there are lots of projects out there that combine Requests with one of Python's asynchronicity frameworks. Two excellent examples are grequests and requests-futures.

Using this method gives me a noticable performance increase with 10 URLs: 0.877s vs 3.852s with your original method.

Community
  • 1
  • 1
Will
  • 24,082
  • 14
  • 97
  • 108
  • In my testing environment I'm sending requests to 88 urls in the same domain. It takes ~60 seconds to complete when using a single process. Using gevent it is still ~60 seconds. Unfortunately, using grequests it is sittl ~60 seconds. Much slower than 0.8 secs per 10 urls. Can this be caused by a limitation of the target server? – krypt Jul 09 '16 at 09:44
  • 1
    That's totally possible. Could you show your new `grequests` code in an edit below your original post? Test the server using `ab -c 10 -n 100 ` (ApacheBench). My test URLs were all major, efficient sites. You can also add a `size=5` or `size=10` to `map()` to limit the amount of concurrent requests, which could increase performance. – Will Jul 09 '16 at 09:52
  • I noticed I'm getting much better performance with lower numbers of `size`. – Will Jul 09 '16 at 09:54
  • Yep I think the culprit is the server. The benchmark takes around 13 seconds in contrast to ~1 sec for google. I had no idea ApcheBench existed.I also found out that I'm getting roughly the same time using the urls you've provided in the example. Your solution with grequests definitely works. I shouldn't have forgotten that scraping is dependent on the scraped. Thanks for your time. – krypt Jul 09 '16 at 10:05
  • Ah, that makes sense. No problem at all, glad to help :) By tweaking `size`, you should still be able to get better performance than `requests` alone. – Will Jul 09 '16 at 10:09
  • 2
    Went down to ~30 seconds with size=2 as you had predicted :) – krypt Jul 09 '16 at 10:18
  • Awesome, that helps at least! :) – Will Jul 09 '16 at 10:21
  • If you're developing a GUI application which requires non-blocking HTTP request, then grequests is not for you. It actually waits for all request before continue executing next instructions (join-blocking). Read more here for the issue : https://github.com/kennethreitz/grequests/issues/82 – Mohd Shahril Nov 05 '16 at 12:44