2

I have a python script that takes a day to run. I need to run the same code for different parameters. However, using a for loop to iterate over all the parameters is impossible because it would result in an even longer computational time. This is a simple example to depict the situation:

ValuesParams=[1,2,3,4,5]
for i in ValuesParams:
     do something
     output file_i.csv

I would like to run 5 different programs (in a computer cluster), with 5 different values, that output 5 different csv files with different names, but doing it at the same time. Because running that for loop in a single program would take 5 days. In reality it's not just 5 parameters, and the program is not a simple loop.

How can I do this? Any piece of advice or information on what to look for would be incredibly useful.

EDIT Thanks to all the replies, specially remram. They put me in the path to find the answer.

Solution 1: I created a script RunPrograms.sh which iterate over the parameters var1, var2, then passed the parameters to my python script test.py and ran all of the jobs at the same time. One limitation persists, I have to be careful with sending for example 1000 jobs at the same time, but this is what I need it.

#!/bin/bash
for var1 in $(seq 1 0.5 3); do
        for var2 in $(seq 1 3); do
                python3 test.py $var1 $var2 &
        done
done

and inside the python script:

import sys
a=float(sys.argv[1])
b=float(sys.argv[2])
def Code(a,b)
    do_something

Solution 2: Inside the python script using multiprocessing along with starmap to pass multiple parameters:

def Code(T,K):
    do_something
RangeT=np.linspace(0.01,0.05,num=10)
RangeK=[2,3,4]
Z=list(itertools.product(RangeT, RangeK))
if __name__ == '__main__':  
    pool = Pool(4)
    results = pool.starmap(Code, Z)
Sebastian
  • 73
  • 1
  • 1
  • 6
  • A program that runs one day? It should have possibilities for refactoring. Did you profile the program and do you know what the bottlenecks are? – Walter A Aug 21 '20 at 19:40
  • Yes, it takes 1+ day because I am simulating protein levels in cells. So running one cell takes 1 sec, but running 100k takes more time. The program has been already optimized using numba, however it is still not enough. – Sebastian Aug 22 '20 at 16:24

3 Answers3

2

You can either use ProcessPoolExecutor from within your Python code:

from concurrent.futures.process import ProcessPoolExecutor

ValuesParams=[1,2,3,4,5]
with ProcessPoolExecutor(5) as pool:
    pool.map(do_something, ValuesParams)

Or you can work at the shell level and simply run your program multiple times with different command-line arguments, for example using parallel:

parallel python myscript.py -- "1" "2" "3" "4" "5"
import sys

i = int(sys.argv[1])
do_something(i)
output file_i.csv
remram
  • 4,805
  • 1
  • 29
  • 42
  • Thanks for your answer, but could you please elaborate on the shell solution, for example if instead of 5 different parameters I need to change 1000, and can it be applied if I need to vary 2 or more parameters a=[1,2,3,4 ..], b=[0.1,0.2,0.3 ..]. The pool solution doesn't work for me because I am also using numba to speed up the processes so when I run pool it omits the flag. – Sebastian Aug 22 '20 at 16:22
  • The process pool should work with numba as well. With `parallel`, you can pass multiple arguments using `-n`, and run more jobs at a time using `-j`. – remram Aug 23 '20 at 17:09
0

You could use threads

import threading

def something(arg):
    pass
ValuesParams=[1,2,3,4,5]
for i in ValuesParams:
    thread=threading.Thread(target=something,args=(i,))
    thread.start()

Advay168
  • 607
  • 6
  • 10
  • Thanks for your answer, I think this can work, but if the function `something` returns a variable `result` that I need to use later in the program (inside). How can I call it? Where is it assigned? – Sebastian Aug 22 '20 at 17:11
  • You could have a global dictionary that each thread can write their return value to – Advay168 Aug 28 '20 at 07:15
0

You can look at parallel processing with xargs.
A simple example is time head -12 <(yes "1") | xargs -n1 -P4 sleep, which will run 12 sleep 1 commands, 4 parallel. The command will take 3 seconds.
First check how many processens you want running parallel (I think all 5, but maybe you want to start with 3). In your case, when you want programs program1 .. program5 running with at most 3 parallel, use:

printf "%s\n" {1..5} | xargs -I"{}" -P3 bash -c 'program{} > file_{}.csv'

When you want to test above first, use

echo "Start at $(date +%S)"; printf "%s\n" {1..5} | 
xargs -I"{}" -P3 bash -c 'sleep 2; echo "program {} finished at $(date +%S)" | tee file_{}.csv'
Walter A
  • 19,067
  • 2
  • 23
  • 43