I have been trying to access some data from a website. I have been using Python's mechanize, and beautifulsoup4 packages for this purpose. But since the amount of pages that I have to parse is around 100,000 and more, doing it single with a single thread doesn't make sense. I tried python's EventLet package to have some concurrency, but it didn't yield any improvement. Can anyone suggest something else that I can do, or should do to speed up the data acquisition process?
-
Tried the multiprocessing module? – Aug 01 '12 at 08:26
-
try [Scrapy](http://scrapy.org/) for scraping web pages – warvariuc Aug 01 '12 at 09:26
-
warwaruk, I read somewhere that Scrapy doesn't let you fill forms. I ditched it because of that reason. – user1343318 Aug 01 '12 at 21:28
1 Answers
I am going to quote my own answer to this question since it fits perfectly here as well:
For your specific task I would recommend a multiprocessing worker pool. You simply define a pool and tell it how many processes you want to use (one per processor core by default) as well as a function you want to run on each unit of work. Then you ready every unit of work (in your case this would be a list of URLs) in a list and give it to the worker pool.
Your output will be a list of the return values of your worker function for every item of work in your original array. All the cool multi-processing goodness will happen in the background. There is of course other ways of working with the worker pool as well, but this is my favourite one.
Happy multi-processing!
-
Thank you, Codemonkey. It is giving me significant improvement. One question though, are the processes in worker pool synchronized themselves? I mean, I have to fill a form, submit it, wait for result, and then I put those results into a file. The form has a field which grows linearly, i.e., 1,2,3,4,... n. I will like my program to retrieve it whatever way it does, but when pasting the results in the file, it must do it in a chronological order. Any help there? – user1343318 Aug 01 '12 at 21:31
-
Yes, you receive the output of your worker pool map in the original order. For instance, if your data looks like `[1,2,3,4,5]` and your work function looks like this `def work(x): return x*2` then your output would be `[2,4,6,8,10]` and so on. – Hubro Aug 01 '12 at 21:56