Use bash to iterate python function over several files

Question

I have a python function that takes a file as an argument, let's call it func(file). I also have a directory containing around 100 files, which shall be passed to a python script execute_function.py, this python script takes a list of files and apply func(file) to then. What is the simplest way to subdivide those 100 files in to 4 stacks of 25 files and execute then in 4 different python kernels with execute_function.py?

It's basically a brute force parallelization with bash.

Does this answer your question? [How do I split a list into equally-sized chunks?](https://stackoverflow.com/questions/312443/how-do-i-split-a-list-into-equally-sized-chunks) — jordanm, Jul 06 '22 at 15:56
Not really, I was expecting to split the list of files into 4 lists using bash and then apply the python function 4 times using these lists as arguments. — prm, Jul 06 '22 at 15:58
The `subprocess` library lets you run scripts / bash commands and capture the output, which you could then pass call your Python function on. — Erik B, Jul 06 '22 at 16:42

score 0 · Answer 1 · answered Jul 06 '22 at 16:27

Not a bash answer, per se, but you can use multiprocessing to achieve this. Here's an example where I split an array to 4 chunks and concurrently work with these 4 chunks in a standalone function.

import concurrent.futures
import time


def split_array(array, num_splits):
    chunksize = len(array) // num_splits
    chunk = 0
    for i in range(num_splits):
        if i + 1 == num_splits:
            yield array[chunk:]
        else:
            yield array[chunk : chunk + chunksize]
        chunk += chunksize


def map_processes(func, iterable_):
    with concurrent.futures.ProcessPoolExecutor() as executor:
        result = executor.map(func, iterable_)
    return result


def dummy_func(l):
    for i in l:
        print(i)
        time.sleep(0.5)


if __name__ == "__main__":
    array = list(range(100))
    split_arrays = split_array(array, 4)
    map_processes(dummy_func, split_arrays)

score 0 · Answer 2 · edited Sep 13 '22 at 05:21

Using GNU Parallel

parallel --jobs 4 python execute_function.py ::: files*

By default it will run one job per cpu-core. This can be adjusted with --jobs.

GNU Parallel is a general parallelizer and makes is easy to run jobs in parallel on the same machine or on multiple machines you have ssh access to.

If you have 32 different jobs you want to run on 4 CPUs, a straight forward way to parallelize is to run 8 jobs on each CPU:

Simple scheduling

GNU Parallel instead spawns a new process when one finishes - keeping the CPUs active and thus saving time:

GNU Parallel scheduling

Installation

For security reasons you should install GNU Parallel with your package manager, but if GNU Parallel is not packaged for your distribution, you can do a personal installation, which does not require root access. It can be done in 10 seconds by doing this:

$ (wget -O - pi.dk/3 || lynx -source pi.dk/3 || curl pi.dk/3/ || \
   fetch -o - http://pi.dk/3 ) > install.sh
$ sha1sum install.sh | grep 883c667e01eed62f975ad28b6d50e22a
12345678 883c667e 01eed62f 975ad28b 6d50e22a
$ md5sum install.sh | grep cc21b4c943fd03e93ae1ae49e28573c0
cc21b4c9 43fd03e9 3ae1ae49 e28573c0
$ sha512sum install.sh | grep da012ec113b49a54e705f86d51e784ebced224fdf
79945d9d 250b42a4 2067bb00 99da012e c113b49a 54e705f8 6d51e784 ebced224
fdff3f52 ca588d64 e75f6033 61bd543f d631f592 2f87ceb2 ab034149 6df84a35
$ bash install.sh

For other installation options see http://git.savannah.gnu.org/cgit/parallel.git/tree/README

Use bash to iterate python function over several files

2 Answers2