I implemented a simple web scraping using tornado, the main idea is to insert all urls to a queue q
and spawn multiple workers, to ping the url and check it' status (most urls doesn't exists i.e getting timeouts)
all responses inserted to another queue q2
but it's irrelevant cause processing this queue happen after all workers done
I also implemented the same methodology using threads, with the same number as concurrency
and thread implementation is much faster although the treads are idle during waiting for response from the web while tornado IOLoop should be optimal for this kind of behaviour
what am I missing? thx in advanced
from tornado import httpclient, gen, ioloop, queues
concurrency = 100
@gen.coroutine
def get_response(url):
response = yield httpclient.AsyncHTTPClient().fetch(url, raise_error=False)
return response
@gen.coroutine
def main():
q = queues.Queue()
q2 = queues.Queue()
@gen.coroutine
def fetch_url():
url = yield q.get()
try:
response = yield get_response(url)
q2.put((url, response.code))
finally:
q.task_done()
@gen.coroutine
def worker():
while True:
yield fetch_url()
for url in urls:
q.put(url)
print("all tasks were sent...")
# Start workers, then wait for the work queue to be empty.
for _ in range(concurrency):
worker()
print("workers spwaned")
yield q.join()
print("done")
if __name__ == '__main__':
io_loop = ioloop.IOLoop.current()
io_loop.run_sync(main)
the thread impl is simple (no multi-processing) and uses this following code
for i in range(concurrency):
t = threading.Thread(target=worker, args=())
t.setDaemon(True)
t.start()