I'm trying to write a web crawler thing and want to make HTTP requests as quickly as possible. tornado's AsyncHttpClient seems like a good choice, but all the example code I've seen (e.g. https://stackoverflow.com/a/25549675/1650177) basically call AsyncHttpClient.fetch
on a huge list of URLs to let tornado queue them up and eventually make the requests.
But what if I want to process an indefinitely long (or just a really big) list of URLs from a file or the network? I don't want to load all the URLs into memory.
Googled around but can't seem to find a way to AsyncHttpClient.fetch
from an iterator. I did however find a way to do what I want using gevent: http://gevent.org/gevent.threadpool.html#gevent.threadpool.ThreadPool.imap. Is there a way to do something similar in tornado?
One solution I've thought of is to only queue up so many URLs initially then add logic to queue up more when a fetch
operation completes but I'm hoping there's a cleaner solution.
Any help or recommendations would be appreciated!