8

Suppose I have a single function processing. I want to run the same function multiple times for multiple parameters parallelly instead of sequentially one after the other.

def processing(image_location):
    
    image = rasterio.open(image_location)
    ...
    ...
    return(result)

#calling function serially one after the other with different parameters and saving the results to a variable.
results1 = processing(r'/home/test/image_1.tif')
results2 = processing(r'/home/test/image_2.tif')
results3 = processing(r'/home/test/image_3.tif')

For example, If I run delineation(r'/home/test/image_1.tif') then delineation(r'/home/test/image_2.tif') and then delineation(r'/home/test/image_3.tif'), as shown in the above code, it will run sequentially one after the other and if it takes 5 minutes for one function to run then running these three will take 5x3=15 minutes. Hence, I am wondering if I can run these three parallelly/embarrassingly parallel so that it takes only 5 minutes to execute the function for all the three different parameters.

Help me with the fastest way to do this job. The script should be able to utilize all the resources/CPU/ram available by default to do this task.

mArk
  • 630
  • 8
  • 13

4 Answers4

3

You can use multiprocessing to execute functions in parallel and save results to results variable:

from multiprocessing.pool import ThreadPool

pool = ThreadPool()
images = [r'/home/test/image_1.tif', r'/home/test/image_2.tif', r'/home/test/image_3.tif']
results = pool.map(delineation, images)
Alderven
  • 7,569
  • 5
  • 26
  • 38
2

You might want to take a look at IPython Parallel. It allows you to easily run functions on a load-balanced (local) cluster.

For this little example, make sure you have IPython Parallel, NumPy and Pillow installed. To run the the example, you need first to launch the cluster. To launch a local cluster with four parallel engines, type into a terminal (one engine for one processor core seems a reasonable choice):

ipcluster 4

Then you can run the following script, which searches for jpg-images in a given directory and counts the number of pixels in each image:

import ipyparallel as ipp


rc = ipp.Client()
with rc[:].sync_imports():  # import on all engines
    import numpy
    from pathlib import Path
    from PIL import Image


lview = rc.load_balanced_view()  # default load-balanced view
lview.block = True  # block until map() is finished


@lview.parallel()
def count_pixels(fn: Path):
    """Silly function to count the number of pixels in an image file"""
    im = Image.open(fn)
    xx = numpy.asarray(im)
    num_pixels = xx.shape[0] * xx.shape[1]
    return fn.stem, num_pixels


pic_dir = Path('Pictures')
fn_lst = pic_dir.glob('*.jpg')  # list all jpg-files in pic_dir

results = count_pixels.map(fn_lst)  # execute in parallel

for n_, cnt in results:
    print(f"'{n_}' has {cnt} pixels.")
Dietrich
  • 5,241
  • 3
  • 24
  • 36
0

Another way of writing with the multiprocessing library (see @Alderven for a different function).

import multiprocessing as mp

def calculate(input_args):
    result = input_args * 2
    return result

N = mp.cpu_count()
parallel_input = np.arange(0, 100)
print('Amount of CPUs ', N)
print('Amount of iterations ', len(parallel_input))

with mp.Pool(processes=N) as p:
    results = p.map(calculate, list(parallel_input))

The results variable will contain a list with your processed data. Which you are then able to write.

zwep
  • 1,207
  • 12
  • 26
0

I think one of the easiest methods is using joblib:

import joblib

allJobs = []
allJobs.append(joblib.delayed(processing)(r'/home/test/image_1.tif'))
allJobs.append(joblib.delayed(processing)(r'/home/test/image_2.tif'))
allJobs.append(joblib.delayed(processing)(r'/home/test/image_3.tif'))

results = joblib.Parallel(n_jobs=joblib.cpu_count(), verbose=10)(allJobs)

Amir
  • 1,871
  • 1
  • 12
  • 10