2

I am using python multiprocessing library for executing a selenium script. My code is below :

#-- start and join multiple threads ---
thread_list = []
total_threads=10 #-- no of parallel threads
for i in range(total_threads):
    t = Process(target=get_browser_and_start, args=[url,nlp,pixel])
    thread_list.append(t)
    print "starting thread..."
    t.start()

for t in thread_list:
    print "joining existing thread..."
    t.join()

As I understood the join() function, it will wait for each process to complete. But I want that as soon as a process is released, it will be assigned another task to perform new function.

It can be understood like this:

Say 8 processes started in first instance.

no_of_tasks_to_perform = 100

for i in range(no_of_tasks_to_perform):
    processes start(8)
    if process no 2 finished executing, start new process
    maintain 8 process at any point of time till 
    "i" is <= no_of_tasks_to_perform
Right leg
  • 16,080
  • 7
  • 48
  • 81

1 Answers1

3

Instead of starting new processes every now and then, try to put all your tasks into a multiprocessing.Queue(), and start 8 long-running processes, in each process keep accessing the task queue to get new tasks and then do the job, until there's no task any more.

In your case, it's more like this:

from multiprocessing import Queue, Process

def worker(queue):
    while not queue.empty():
        task = queue.get()

        # now start to work on your task
        get_browser_and_start(url,nlp,pixel) # url, nlp, pixel can be unpacked from task

def main():
    queue = Queue()

    # Now put tasks into queue
    no_of_tasks_to_perform = 100

    for i in range(no_of_tasks_to_perform):
        queue.put([url, nlp, pixel, ...]) 

    # Now start all processes
    process = Process(target=worker, args=(queue, ))
    process.start()
    ...
    process.join()
Shane
  • 2,231
  • 1
  • 13
  • 17
  • @shane , Where are the 8 processors in this setup? Should it have been simply: `process.start(8)`. I have a custom python module I source which I can initialize the class to establish the webDriver instance and then call my scraping function with the parameters in the queue. But wouldn't I need to instantiate 8 different webDrivers into a pool - because I am wondering how one X-window frame buffer(Xvfb) and headless chromedriver instance can act as 8 different processes to execute a queue of tasks (in the thousands)? – Ricalsin Jan 27 '17 at 07:09
  • In this setup, you actually manually start 8 processes (or whatever you want), and make each one a long-running process to continuously fetch new task, (instantiate browser and do stuff, in your case), e.g. `process1 = Process(target=worker, args=(queue, ))` ... `process8 ...`. It's a different setup if you want to use `multiprocessing.Pool`, you need to pass your function with `map`, but it's actually not as convenient in your case, especially when it comes to multiple parameters, check this out: http://stackoverflow.com/a/5442981/7405394 – Shane Jan 27 '17 at 07:47
  • @shane , 'hmm... I missed the **bold text** in your opening sentence; Sorry/Thanks. Another option could be to [instantiate the webDriver inside the worker of the queue](http://stackoverflow.com/questions/39824273/multiprocessing-and-selenium-python#answer-39843502) and then roughly controlling the number of processes launched by timing the execution of the cue - allowing processes to die off while new ones come on. This actually helps me to better control the number of requests made per hour through a proxy switcher whereas the number of processors can vary due to the I/O and waits. (???) – Ricalsin Jan 27 '17 at 16:49