how to crawl and process (cpu intensive) thousands of URLs using gevent and threading?

Question

I've been playing with tornado, twisted, gevent, grequests in order to get the best performance in the process of fetching 50k urls.

The process I want to create:

parse all urls into a set (avoid duplicates)
for each url: check existence of the url in a whitelist redis db (contains millions of urls)
download the urls using gevent/other async libraries
insert the fetched content into a queue
in parallel: listen to the queue with threads
process (intensive regex) the queue items using threads
save the output to a MySQL DB
for each url update the whitelist redis db

I'm going to process millions urls a day, I started implementing this but got into a few problems;

firstly, populating a queue with the results I get from the async-crawler consumes to much memory - I need to address it, what would be a good practice? secondly, I'm having hard time synchronizing both threading and the gevent crawler, how to I download asyncly and process while populating the queue with results?

Or, How do I synchronize the async-crawler with the threading code that process the response from the async-crawler?

Thanks!

score 1 · Answer 1 · edited May 23 '17 at 12:28

1

gevent, twisted, asyncio should handle 50k urls just fine.

To avoid consuming too much memory and to synchronize the processes that download and process urls, you could set the maximum size on the corresponding queues: if the downloading happens too fast; it will block on queue.put() when it reaches its maximum capacity.

Green threads would be useless for a parallel regex processing. Ordinary Python threads that use real OS threads would be useless here if GIL is not released during the regex processing. re does not release GIL. regex module can release GIL in some cases.

If you use re module; you might want to create a processes pool instead of threads where the number of processes ~number of cpus on the host.

Beware of how you use MySQL, Redis (you might need a green driver for some usage scenarios).

edited May 23 '17 at 12:28

Community

1
1

answered Mar 29 '15 at 19:37

jfs

399,953
195
994
1,670

should I use gevent's queue or threading's queue? how would you suggest using Redis? – YSY Mar 30 '15 at 06:27
@YSY: 1. if you want to block the producer thread that fetches the urls (stop the world (the event loop)); then use `Queue.Queue` (unpatched; blocking API) to allow the consumer threads to catch up. 2. start with the simplest approach (redis with patched sockets: green driver) -- the code looks identical to the synchronous single-thread case. Measure the time performance and change the approach if necessary. – jfs Mar 30 '15 at 16:31

how to crawl and process (cpu intensive) thousands of URLs using gevent and threading?

1 Answers1