2

I'm running Python 2.7 on the GCE platform to do calculations. The GCE instances boot, install various packages, copy 80 Gb of data from a storage bucket and runs a "workermaster.py" script with nohangup. The workermaster runs on an infinite loop which checks a task-queue bucket for tasks. When the task bucket isn't empty it picks a random file (task) and passes work to a calculation module. If there is nothing to do the workermaster sleeps for a number of seconds and checks the task-list again. The workermaster runs continuously until the instance is terminated (or something breaks!).

Currently this works quite well, but my problem is that my code only runs instances with a single CPU. If I want to scale up calculations I have to create many identical single-CPU instances and this means there is a large cost overhead for creating many 80 Gb disks and transferring the data to them each time, even though the calculation is only "reading" one small portion of the data for any particular calculation. I want to make everything more efficient and cost effective by making my workermaster capable of using multiple CPUs, but after reading many tutorials and other questions on SO I'm completely confused.

I thought I could just turn the important part of my workermaster code into a function, and then create a pool of processes that "call" it using the multiprocessing module. Once the workermaster loop is running on each CPU, the processes do not need to interact with each other or depend on each other in any way, they just happen to be running on the same instance. The workermaster prints out information about where it is in the calculation and I'm also confused about how it will be possible to tell the "print" statements from each process apart, but I guess that's a few steps from where I am now! My problems/confusion are that:

1) My workermaster "def" doesn't return any value because it just starts an infinite loop, where as every web example seems to have something in the format myresult = pool.map(.....); and 2) My workermaster "def" doesn't need any arguments/inputs - it just runs, whereas the examples of multiprocessing that I have seen on SO and on the Python Docs seem to have iterables.

In case it is important, the simplified version of the workermaster code is:

# module imports are here
# filepath definitions go here

def workermaster():

    while True:

        tasklist = cloudstoragefunctions.getbucketfiles('<my-task-queue-bucket')

        if tasklist:

            tasknumber = random.randint(2, len(tasklist))
            assignedtask = tasklist[tasknumber]

            print 'Assigned task is now: ' + assignedtask

            subprocess.call('gsutil -q cp gs://<my-task-queue-bucket>/' + assignedtask + ' "' + taskfilepath + assignedtask + '"', shell=True)

            tasktype = assignedtask.split('#')[0]

            if tasktype == 'Calculation':
                currentcalcid = assignedtask.split('#')[1]
                currentfilenumber = assignedtask.split('#')[2].replace('part', '')
                currentstartfile = assignedtask.split('#
                currentendfile = assignedtask.split('#')[4].replace('.csv', '')

                calcmodule.docalc(currentcalcid, currentfilenumber, currentstartfile, currentendfile)

            elif tasktype == 'Analysis':

                #set up and run analysis module, etc.                   

            print '   Operation completed!'

            os.remove(taskfilepath + assignedtask)

        else:

            print 'There are no tasks to be processed.  Going to sleep...'
            time.sleep(30)

Im trying to "call" the function multiple times using the multiprocessing module. I think I need to use the "pool" method, so I've tried this:

import multiprocessing

if __name__ == "__main__":

    p = multiprocessing.Pool()
    pool_output = p.map(workermaster, [])

My understanding from the docs is that the __name__ line is there only as a workaround for doing multiprocessing in Windows (which I am doing for development, but GCE is on Linux). The p = multiprocessing.Pool() line is creating a pool of workers equal to the number of system CPUs as no argument is specified. It the number of CPUs was 1 then I would expect the code to behave as it does before I attempted to use multiprocessing. The last line is the one that I don't understand. I thought that it was telling each of the processors in the pool that the "target" (thing to run) is workermaster. From the docs there appears to be a compulsory argument which is an iterable, but I don't really understand what this is in my case, as workermaster doesn't take any arguments. I've tried passing it an empty list, empty string, empty brackets (tuple?) and it doesn't do anything.

Please would it be possible for someone help me out? There are lots of discussions about using multiprocessing and this thread Mulitprocess Pools with different functions and this one python code with mulitprocessing only spawns one process each time seem to be close to what I am doing but still have iterables as arguments. If there is anything critical that I have left out please advise and I will modify my post - thank you to anyone who can help!

Paul
  • 528
  • 5
  • 17
  • pool is useful if you want to run the same function with different argumetns. If you want to run function only once then use normall `Process()`. If you want to run the same function 2 times then you can manually create 2 `Process()`. If you want to use `Pool()` to run 2 times then add list with 2 arguments (even if you don't need it) because it is information for `Pool()` to run it 2 times. But if you run 2 times function which works with the same folder then you may have conflict - you will run 2 times the same task. – furas Dec 26 '19 at 17:41
  • You will need to redefine your function to use at least one argument (you can discard it), if you want to use Pool and map. https://stackoverflow.com/questions/27689834/can-i-use-map-imap-imap-unordered-with-functions-with-no-arguments – rajendra Dec 26 '19 at 17:43
  • Thank you @furas and @rajendra. I added an argument to the workerfunction so it is now `def workermaster(x):` I also use `x` as a variable for telling the CPU threads apart, by modifying print statements to something like `print 'CPU-' + str(x) + ': Status is now....'` etc. One problem I have noticed with using the pool.map approach is that I cant kill the process on my laptop now using CTRL+C. I have to close the command prompt and start a new one - is there any particular reason/fix for this? If someone would like to write their response as an answer I'd be very happy to accept it. – Paul Dec 27 '19 at 03:24
  • Google `python multiprocessing ctrl+c` gives me: [Catch Ctrl+C / SIGINT and exit multiprocesses gracefully in python](https://stackoverflow.com/questions/11312525/catch-ctrlc-sigint-and-exit-multiprocesses-gracefully-in-python) – furas Dec 27 '19 at 08:24

1 Answers1

2

Pool() is useful if you want to run the same function with different argumetns.

If you want to run function only once then use normal Process().
If you want to run the same function 2 times then you can manually create 2 Process().

If you want to use Pool() to run function 2 times then add list with 2 arguments (even if you don't need arguments) because it is information for Pool() to run it 2 times.

But if you run function 2 times with the same folder then it may run 2 times the same task. if you will run 5 times then it may run 5 times the same task. I don't know if it is needed.


As for Ctrl+C I found on Stackoverflow Catch Ctrl+C / SIGINT and exit multiprocesses gracefully in python but I don't know if it resolves your problem.

furas
  • 134,197
  • 12
  • 106
  • 148
  • 1
    Thank you once again furas and apologies for not Googling the CTRL+C query - I shouldn't have been so lazy! I couldn't get this to work using the Process method but the pool and mapping one works well. The CTRL+C looks like it needs some reading and further work but hopefully I will have it working soon. Thank you! – Paul Dec 27 '19 at 14:17