I've been playing with tornado, twisted, gevent, grequests in order to get the best performance in the process of fetching 50k urls.
The process I want to create:
- parse all urls into a set (avoid duplicates)
- for each url: check existence of the url in a whitelist redis db (contains millions of urls)
- download the urls using gevent/other async libraries
- insert the fetched content into a queue
- in parallel: listen to the queue with threads
- process (intensive regex) the queue items using threads
- save the output to a MySQL DB
- for each url update the whitelist redis db
I'm going to process millions urls a day, I started implementing this but got into a few problems;
firstly, populating a queue with the results I get from the async-crawler consumes to much memory - I need to address it, what would be a good practice? secondly, I'm having hard time synchronizing both threading and the gevent crawler, how to I download asyncly and process while populating the queue with results?
Or, How do I synchronize the async-crawler with the threading code that process the response from the async-crawler?
Thanks!