8

I want to use subprocesses to let 20 instances of a written script run parallel. Lets say i have a big list of urls with like 100.000 entries and my program should control that all the time 20 instances of my script are working on that list. I wanted to code it as follows:

urllist = [url1, url2, url3, .. , url100000]
i=0
while number_of_subproccesses < 20 and i<100000:
    subprocess.Popen(['python', 'script.py', urllist[i]]
    i = i+1

My script just writes something into a database or textfile. It doesnt output anything and dont need more input than the url.

My problem is i wasnt able to find something how to get the number of subprocesses that are active. Im a novice programmer so every hint and suggestion is welcome. I was also wondering how i can manage it once the 20 subprocesses are loaded that the while loop checks the conditions again? I thought of maybe putting another while loop over it, something like

while i<100000
   while number_of_subproccesses < 20:
       subprocess.Popen(['python', 'script.py', urllist[i]]
       i = i+1
       if number_of_subprocesses == 20:
           sleep() # wait to some time until check again

Or maybe theres a bette possibility that the while loop is always checking on the number of subprocesses?

I also considered using the module multiprocessing, but i found it really convenient to just call the script.py with subprocessing instead of a function with multiprocessing.

Maybe someone can help me and lead me into the right direction. Thanks Alot!

Steve Barnes
  • 27,618
  • 6
  • 63
  • 73
zwieback86
  • 387
  • 3
  • 7
  • 14
  • related: [Limiting number of processes in multiprocessing python](http://stackoverflow.com/q/23236190/4279) – jfs May 31 '15 at 16:31

3 Answers3

6

Taking a different approach from the above - as it seems that the callback can't be sent as a parameter:

NextURLNo = 0
MaxProcesses = 20
MaxUrls = 100000  # Note this would be better to be len(urllist)
Processes = []

def StartNew():
   """ Start a new subprocess if there is work to do """
   global NextURLNo
   global Processes

   if NextURLNo < MaxUrls:
      proc = subprocess.Popen(['python', 'script.py', urllist[NextURLNo], OnExit])
      print ("Started to Process %s", urllist[NextURLNo])
      NextURLNo += 1
      Processes.append(proc)

def CheckRunning():
   """ Check any running processes and start new ones if there are spare slots."""
   global Processes
   global NextURLNo

   for p in range(len(Processes):0:-1): # Check the processes in reverse order
      if Processes[p].poll() is not None: # If the process hasn't finished will return None
         del Processes[p] # Remove from list - this is why we needed reverse order

   while (len(Processes) < MaxProcesses) and (NextURLNo < MaxUrls): # More to do and some spare slots
      StartNew()

if __name__ == "__main__":
   CheckRunning() # This will start the max processes running
   while (len(Processes) > 0): # Some thing still going on.
      time.sleep(0.1) # You may wish to change the time for this
      CheckRunning()

   print ("Done!")
Steve Barnes
  • 27,618
  • 6
  • 63
  • 73
  • Thanks again for ur answer! Some little questions. What does len(Processes):0:-1 do and why its not sufficient to write for p in Processes: ? And i guess in the main procedure, it should be: while (len(Processes) > 0): ? Thanks for the effort again, thats a very creative answer! – zwieback86 Aug 08 '13 at 16:21
  • `[len(Processes):0:-1]` will run through the list of processes if any in reverse order so as to avoid problems that come from deleting from a list while you are traversing it. I'll correct the other. – Steve Barnes Aug 08 '13 at 16:24
  • There is still a little problem with Processes[p].poll() it says that "list indices must be integers, not popen". So is it possible to get the index of our p in the Processes list? Maybe something like Processes.index(p)? – zwieback86 Aug 08 '13 at 17:01
  • 1
    If i extend the if Processes[p].poll() with == 0, everything works fine! Thank you very much again! – zwieback86 Aug 08 '13 at 20:45
  • This is great! I also had to change processes[p].poll() to processes[p].poll() == 0 and for the reverse range I just used reversed(range(len(processes))). Thank you! – mjd2 May 29 '15 at 17:25
  • Warning: Some processes will return values other than zero if they finished but had a problem, (some actually return non-zero for success as well), so we need to check for the poll returning something other than `None` to detect if they have finished. – Steve Barnes May 29 '15 at 20:19
2

Just keep count as you start them and use a callback from each subprocess to start a new one if there are any url list entries to process.

e.g. Assuming that your sub-process calls the OnExit method passed to it as it ends:

NextURLNo = 0
MaxProcesses = 20
NoSubProcess = 0
MaxUrls = 100000

def StartNew():
   """ Start a new subprocess if there is work to do """
   global NextURLNo
   global NoSubProcess

   if NextURLNo < MaxUrls:
      subprocess.Popen(['python', 'script.py', urllist[NextURLNo], OnExit])
      print "Started to Process", urllist[NextURLNo]
      NextURLNo += 1
      NoSubProcess += 1

def OnExit():
   NoSubProcess -= 1

if __name__ == "__main__":
   for n in range(MaxProcesses):
      StartNew()
   while (NoSubProcess > 0):
      time.sleep(1)
      if (NextURLNo < MaxUrls):
         for n in range(NoSubProcess,MaxProcesses):
             StartNew()
Steve Barnes
  • 27,618
  • 6
  • 63
  • 73
  • Wow thank you very much, for that complete solution for my question. that is helping me alot! The callback idea is really nice and handy. Thanks for your effort! – zwieback86 Aug 08 '13 at 11:02
  • A little question again: I tried out your code but somehow its giving me a Typerror: argument of type 'function' is not iterable, in the subprocess.Popen line. I know it has something to do with calling the OnExit function, but what does it mean and how can i prevent this? Maybe for information im using Python3.3 on Windows7. – zwieback86 Aug 08 '13 at 11:54
  • Oh i think i got it, i have to implement to run the passed OnExit in my script.py. Right? – zwieback86 Aug 08 '13 at 11:56
  • It seems that its not possible to pass a function in the argument list, or am i wrong? Is there some kind of workaround? – zwieback86 Aug 08 '13 at 14:50
2

To keep constant number of concurrent requests, you could use a thread pool:

#!/usr/bin/env python
from multiprocessing.dummy import Pool

def process_url(url):
    # ... handle a single url

urllist = [url1, url2, url3, .. , url100000]
for _ in Pool(20).imap_unordered(process_url, urllist):
    pass

To run processes instead of threads, remove .dummy from the import.

jfs
  • 399,953
  • 195
  • 994
  • 1,670