How to run parallelization on multiple files with multiple arguments in python?

Question

I am not familiar with python, but would like to run a function to read and write multiple files in parallel in python. For a minimal example:

from multiprocessing import Pool
import pandas as pd

def multiple(input_path, output_path, n):
    df = pd.read_csv(input_path, index_col=0)
    new_df = df.multiply(n)
    new_df.to_csv(output_path)

workers = 6
input_filenames = [f'input_i.csv' for i in range(1,11)]
output_filenames = [f'output_i.csv' for i in range(1,11)]

with Pool(workers) as pool:
    pool.map(multiple, ...)

If I am using for loop, I can do this like below:

for i, input_file in enumerate(input_filenames):
    input_path = input_filenames[i]
    output_path = output_filenames[i]
    multiple(input_path, output_path, 2)

How should I convert into pool.map to match the index of each input and output filenames and also feed 3 arguments to the function (input_path, output_path, n)?

Thank you!

JRose · Answer 1 · 2022-09-28T23:02:44.027

0

Instead of using a pool you can create a Process, point it to the multiple method, and give it the filenames as arguments.

from multiprocessing import Process

processes = []
for i, input_file in enumerate(input_filenames):
    input_path = input_filenames[i]
    output_path = output_filenames[i]
    multiple(input_path, output_path, 2)
    process = Process(target=multiple, args=(input_path, output_path, 2,))
    processs.append(process)
    process.start()


for processs in processes:
    p.join()

edited Sep 28 '22 at 23:02

answered Sep 28 '22 at 22:40

JRose

1,382
2
5
18

thanks @jrose. I have two follow-up questions. How do I set how many cores to use using `Process` (I would like to limit cores rather than all available cores)? And what is the purpose of removing one of processes in `processes[-1].start()` or it's just common way to do it? ` – WenliL Sep 28 '22 at 22:55
`processes` is an array, and accessing `[-1]` of an array accesses the last element, which I then call a method of (`start()`). I wrote the code this way to avoid storing the process in a temporary variable but I can see how it is confusing. To set the number of cores you would need to write the code differently - you would need to put all of the input/output files into a multiprocessing Queue and then have each process access the queue to "get new work." See: https://stackoverflow.com/questions/11515944/how-to-use-multiprocessing-queue-in-python – JRose Sep 28 '22 at 23:00

AndrzejO · Accepted Answer · 2022-09-28T23:00:13.570

You can zip three lists input_filenames/output_filenames/n to one list, and then you give the function multiple and that list to pool.map. It will take each element from the list, and a worker will execute multiple on it.

def multiple(inp_out_n):
    input_path, output_path, n = inp_out_n
    df = pd.read_csv(input_path, index_col=0)
    new_df = df.multiply(n)
    new_df.to_csv(output_path)

workers = 6
input_filenames = [f'input_{i}.csv' for i in range(1,11)]
input_filenames = [f'output_{i}.csv' for i in range(1,11)]
n = 10 * [2]
in_out_n = zip(input_filenames, input_filenames, n)

pool = multiprocessing.Pool(processes=workers)
pool.map(multiple, in_out_n)
pool.close()

How to run parallelization on multiple files with multiple arguments in python?

2 Answers2