0

I implemented a simple web scraping using tornado, the main idea is to insert all urls to a queue q and spawn multiple workers, to ping the url and check it' status (most urls doesn't exists i.e getting timeouts)

all responses inserted to another queue q2 but it's irrelevant cause processing this queue happen after all workers done

I also implemented the same methodology using threads, with the same number as concurrency and thread implementation is much faster although the treads are idle during waiting for response from the web while tornado IOLoop should be optimal for this kind of behaviour

what am I missing? thx in advanced

from tornado import httpclient, gen, ioloop, queues

concurrency = 100

@gen.coroutine
def get_response(url):
    response = yield httpclient.AsyncHTTPClient().fetch(url, raise_error=False)
    return response


@gen.coroutine
def main():
    q = queues.Queue()
    q2 = queues.Queue()

    @gen.coroutine
    def fetch_url():
        url = yield q.get()
        try:
            response = yield get_response(url)
            q2.put((url, response.code))
        finally:
            q.task_done()

    @gen.coroutine
    def worker():
        while True:
            yield fetch_url()

    for url in urls:
        q.put(url)

    print("all tasks were sent...")

    # Start workers, then wait for the work queue to be empty.
    for _ in range(concurrency):
        worker()

    print("workers spwaned")

    yield q.join()
    
    print("done")


if __name__ == '__main__':
    io_loop = ioloop.IOLoop.current()
    io_loop.run_sync(main)

the thread impl is simple (no multi-processing) and uses this following code

for i in range(concurrency):
  t = threading.Thread(target=worker, args=())
  t.setDaemon(True)
  t.start()
shahaf
  • 4,750
  • 2
  • 29
  • 32
  • You should check [this](https://stackoverflow.com/questions/1050222/what-is-the-difference-between-concurrency-and-parallelism), specially if you're running your code on a multi core machine. – yorodm May 12 '18 at 15:45
  • @yorodm yeah I know... but as I understood the GIL prevents my python's threads run in "true" parallel – shahaf May 12 '18 at 15:59
  • you'll get better performance if you try multiprocessing – Skam May 12 '18 at 16:04
  • @Skam i know, but my question is why simple thread gives better performance than my IOLoop impl – shahaf May 12 '18 at 16:06
  • *`"thread implementation is much faster"`* - how much faster? I tested your code but both methods seem to perform equally. I did testing on local server with some time-out urls. – xyres May 12 '18 at 17:38
  • @xyres thx so much for trying to reproduce the issue, I didn't want to get into numbers cause it depends on the system... I'm probing around 50K, the threads impl takes about 40 min, the tornado ioloop takes about 2hr – shahaf May 14 '18 at 08:32

1 Answers1

1

There are several reasons why this might be slower:

  1. The goal of asynchronous programming is not speed, it is scalability. The asynchronous implementation should perform better at high levels of concurrency (in particular, it will use much less memory), but at low levels of concurrency there may not be a difference or threads may be faster.

  2. Tornado's default HTTP client is written in pure python and is missing some features that are important for performance. In particular it is unable to reuse connections. If performance of HTTP client requests is important to you, use the libcurl-based client instead:

    tornado.httpclient.AsyncHTTPClient.configure('tornado.curl_httpclient.CurlAsyncHTTPClient')
    
  3. Sometimes DNS resolution is blocking even in an otherwise-asynchronous HTTP client, which can limit effective concurrency. This was true of Tornado's default HTTP client until Tornado 5.0. For the curl-based client, it depends on how libcurl was built. You need a version of libcurl that was built with the c-ares library. Last time I looked this was not done by default on most linux distributions.

Ben Darnell
  • 21,844
  • 3
  • 29
  • 50
  • thanks, I don't think if this will help but it was my suspicion that the DNS causing a bit if problems, although I didn't know how to explain it like you do, I'll check it, hope it will solve the issue, in the meanwhile I'll mark your answer – shahaf May 14 '18 at 08:35