1

I have an architecture which is basically a queue with url addresses and some classes to process the content of those url addresses. At the moment the code works good, but it is slow to sequentially pull a url out of the queue, send it to the correspondent class, download the url content and finally process it.

It would be faster and make proper use of resources if for example it could read n urls out of the queue and then shoot n processes or threads to handle the downloading and processing.

I would appreciate if you could help me with these:

  1. What packages could be used to solve this problem ?
  2. What other approach can you think of ?
PepperoniPizza
  • 8,842
  • 9
  • 58
  • 100
  • i recently posted samples for parallel-processing: http://stackoverflow.com/a/17963614/2101808 if it useful i repost it there. with cooments of proccesing web content. – eri Jul 31 '13 at 20:50
  • see http://stackoverflow.com/questions/3842237/parallel-processing-in-python?rq=1 for a muliprocessing/Pool solution and http://stackoverflow.com/questions/8241099/executing-tasks-in-parallel-in-python?rq=1 for a threading/Thread solution. There are many, many other ways to do this as well using various frameworks. – roippi Jul 31 '13 at 20:50
  • What part of your code is slowest? – eri Jul 31 '13 at 20:52
  • @eri The slowest part is downloading the url content. – PepperoniPizza Jul 31 '13 at 20:53
  • @PepperoniPizza Which downloader is used? – eri Jul 31 '13 at 21:06
  • That u do with data after processing? – eri Jul 31 '13 at 21:27
  • @eri, well I use requests to download the content of urls, pretty much like reading HTML from a site. Data processing involves passing the text data through several filters and regular expressions. So it would be cool to have isolated processes or threads getting the text (downloading) and passing it through the filters. – PepperoniPizza Jul 31 '13 at 22:09

2 Answers2

2

You might want to look into the Python Multiprocessing library. With multiprocessing.pool, you can give it a function and an array, and it will call the function with each value of the array in parallel, using as many or as few processes as you specify.

sdamashek
  • 636
  • 1
  • 4
  • 13
1

If C-calls are slow, like downloading, database requests, other IO - You can use just threading.Thread

If python code is slow, like frameworks, your logic, not accelerated parsers - You need to use multiprocessing Pool or Process. Also it speedups python code, but it is less tread-save and need to deep understanding how it works in complex code (locks, semaphores).

eri
  • 3,133
  • 1
  • 23
  • 35
  • Thank you, I will give it a look. I have fair experience with threading in C++ so semaphores, locks, joins a not a trouble. – PepperoniPizza Jul 31 '13 at 22:06