1

I have been trying to access some data from a website. I have been using Python's mechanize, and beautifulsoup4 packages for this purpose. But since the amount of pages that I have to parse is around 100,000 and more, doing it single with a single thread doesn't make sense. I tried python's EventLet package to have some concurrency, but it didn't yield any improvement. Can anyone suggest something else that I can do, or should do to speed up the data acquisition process?

Anirudh Ramanathan
  • 46,179
  • 22
  • 132
  • 191
user1343318
  • 2,093
  • 6
  • 32
  • 59

1 Answers1

0

I am going to quote my own answer to this question since it fits perfectly here as well:


For your specific task I would recommend a multiprocessing worker pool. You simply define a pool and tell it how many processes you want to use (one per processor core by default) as well as a function you want to run on each unit of work. Then you ready every unit of work (in your case this would be a list of URLs) in a list and give it to the worker pool.

Your output will be a list of the return values of your worker function for every item of work in your original array. All the cool multi-processing goodness will happen in the background. There is of course other ways of working with the worker pool as well, but this is my favourite one.

Happy multi-processing!

Community
  • 1
  • 1
Hubro
  • 56,214
  • 69
  • 228
  • 381
  • Thank you, Codemonkey. It is giving me significant improvement. One question though, are the processes in worker pool synchronized themselves? I mean, I have to fill a form, submit it, wait for result, and then I put those results into a file. The form has a field which grows linearly, i.e., 1,2,3,4,... n. I will like my program to retrieve it whatever way it does, but when pasting the results in the file, it must do it in a chronological order. Any help there? – user1343318 Aug 01 '12 at 21:31
  • Yes, you receive the output of your worker pool map in the original order. For instance, if your data looks like `[1,2,3,4,5]` and your work function looks like this `def work(x): return x*2` then your output would be `[2,4,6,8,10]` and so on. – Hubro Aug 01 '12 at 21:56