Speed up the number of page I can scrape via threading

Question

I'm currently using beautifulsoup to scrape sourceforge.net for various project information. I'm using the solution in this thread. It works well, but I wish to do it yet faster. Right now I'm creating a list of 15 URLs, and feed them into the run_parallel_in_threads. All the URLs are sourceforge.net links. I'm currently getting about 2.5 pages per second. And it seems that increasing or decreasing the number of URLs in my list doesn't have much effect on the speed. Are there any strategy to increase the number of page I can scrape? Any other solutions that are more suitable for this kind of project?

Where does the bottleneck lie? If it is CPU then you can use `multiprocessing` instead of `threading` (assuming you have multiple cores). If it is bandwith then there is not much you can do (from software point of view). — freakish, Aug 06 '14 at 05:55
Thanks for the suggestion! I actually just assumed that my bottleneck is in my bandwidth. I did some test and it seems that my cpu is actually the bottleneck. I'll look into multiprocessing. — jshen, Aug 06 '14 at 06:09
The bottleneck is almost surely network IO, and yes, threads will help a lot. I don't have any python solution though, sorry. — pguardiario, Aug 08 '14 at 09:12

score 2 · Accepted Answer · answered Aug 06 '14 at 05:58

You could have your threads which run in parallel simply retrieve the web content. Once the html page is retrieved, pass the page into a queue which have multiple workers each parsing a single html page. Now you've essentially pipelined your workflow. Instead of having each thread do multiple steps (retrieve page, scrape, store). Each of your threads in parallel simple retrieve the page and then have it pass the task into a queue which processes these tasks in a round robin approach.

Please let me know if you have any questions!

Speed up the number of page I can scrape via threading

1 Answers1