4

I have a script that takes in input a list of filenames and loops over them to generate an output file per input file, so this is a case which can be easily parallelized I think.

I have a 8 core machine.

I tried on using -parallel flag on this command:

python perfile_code.py list_of_files.txt

But I can't make it work, i.e. specific question is: how to use parallel in bash with a python command in Linux, along with the arguments for the specific case mentioned above.

There is a Linux parallel command (sudo apt-get install parallel), which I read somewhere can do this job but I don't know how to use it.

Most of the internet resources explain how to do it in python but can it be done in bash?

Please help, thanks.

Based on an answer, here is a working example that is still not working, please suggest how to make it work.

I have a folder with 2 files, i just want to create their duplicates with a different name parallely in this example.

# filelist is the directory containing two file names, a.txt and b.txt.
# a.txt is the first file, b.xt is the second file
# i pass an .txt file with both the names to the main program

from concurrent.futures import ProcessPoolExecutor, as_completed
from pathlib import Path
import sys

def translate(filename):
    print(filename)
    f = open(filename, "r")
    g = open(filename + ".x", , "w")
    for line in f:
        g.write(line)

def main(path_to_file_with_list):
    futures = []
    with ProcessPoolExecutor(max_workers=8) as executor:
        for filename in Path(path_to_file_with_list).open():
            executor.submit(translate, "filelist/" + filename)
        for future in as_completed(futures):
            future.result()

if __name__ == "__main__":
     main(sys.argv[1])
Rafael
  • 651
  • 13
  • 30
  • 1
    why vote for close? This is a very specific question, asking how to use parallel in bash with python, along with arguments. I have edited the question to make it more clear, please reconsider. – Rafael Feb 04 '19 at 01:31
  • You're showing fundamental lack of knowledge regarding the topic of parallelism. `-parallel` isn't a valid command line option for Python. Programming for parallel operations usually requires proactive development strategies on the part of the programmer. I'd suggest googling "python parallel". – Ouroborus Feb 04 '19 at 01:50
  • @Ouroborus no, no consider this https://opensource.com/article/18/5/gnu-parallel i want to run a python program along with this parallel..for a very specific case..if an arbitrary convert program can be piped to parallel ..why wouldn't a python program? – Rafael Feb 04 '19 at 01:52
  • That still requires that you understand how parallism works in general and that your software is capable of operating in that environment. As you describe it, your current python script would not benefit from gnu `parallel`. Reading and understanding the article you linked would go a long way towards you understanding what you need to do. – Ouroborus Feb 04 '19 at 01:55
  • There's no turn key `--parallel` flag. You need to write the parallelism yourself see: [multiprocessing](https://docs.python.org/3/library/multiprocessing.html?highlight=process) – stacksonstacks Feb 04 '19 at 02:48
  • Does your program work if you run it like this `cat list_of_files.txt | python perfile_code.py /dev/stdin` ? – Mark Setchell Feb 05 '19 at 12:50

3 Answers3

3

Based on your comment,

@Ouroborus no, no consider this opensource.com/article/18/5/gnu-parallel i want to run a python program along with this parallel..for a very specific case..if an arbitrary convert program can be piped to parallel ..why wouldn't a python program?

I think this might help:

convert wasn't chosen arbitrarily. It was chosen because it is a better known program that (roughly) maps a single input file, provided via the command line, to a single output file, also provided via the command line.

The typical shell for loop can be used to iterate over a list. In the article you linked, they show an example

for i in *jpeg; do convert $i $i.png ; done

This (again, roughly) takes a list of file names and applies them, one by one, to a command template and then runs that command.

The issue here is that for would necessarily wait until a command is finished before running the next one and so may under-utilize today's multi-core processors.

parallel acts a kind of replacement for for. It makes the assumption that a command can be executed multiple times simultaneously, each with different arguments, without each instance interfering with the others.

In the article, they show a command using parallel

find . -name "*jpeg" | parallel -I% --max-args 1 convert % %.png

that is equivalent to the previous for command. The difference (still roughly) is that parallel runs several variants of the templated command simultaneously without necessarily waiting for each to complete.


For your specific situation, in order to be able to use parallel, you would need to:

  • Adjust your python script so that it takes one input (such as a file name) and one output (also possibly a file name), both via the command line.
  • Figure out how to setup parallel so that it can receive a list of those file names for insertion into a command template to run your python script on each of those files individually.
Ouroborus
  • 16,237
  • 4
  • 39
  • 62
  • could you take a look at the working example i edited in the question above? – Rafael Feb 04 '19 at 03:14
  • @Rafael Aside from an obvious syntax error, it looks like it should do what you expect. Rudimentary testing shows it works. – Ouroborus Feb 04 '19 at 05:33
1

You can just use an ordinary shell for command, and append the & background indicator to the python command inside the for:

for file in `cat list_of_files.txt`;
   do python perfile_code.py $file &
done

Of course, assuming your python code will generate separate outputs by itself.

It is just this simple. Although not usual - in general people will favor using Python itself to control the parallel execution of the loop, if you can edit the program. One nice way to do is to use concurrent.futures in Python to create a worker pool with 8 workers - the shell approach above will launch all instances in parallel at once.

Assuming your code have a translate function that takes in a filename, your Python code could be written as:

from concurrent.futures import ProcessPoolExecutor, as_completed
from pathlib import Path:

def translate(filename):
    ...

def main(path_to_file_with_list):
    futures = []
    with ProcessPoolExecutor(max_workers=8) as executor:
        for filename in Path(path_to_file_with_list).open():
            executor.submit(translate, filename)
        for future in as_completed(futures):
            future.result()

if __name__ == "__main__":
     import sys
     main(argv[1])

This won't depend on special shell syntax, and takes care of corner cases, and number-or-workers handling, which could be hard to do properly from bash.

jsbueno
  • 99,910
  • 10
  • 151
  • 209
  • thanks for a working example! i tried your example, please take a look at the edited question, it still doesnt work. – Rafael Feb 04 '19 at 03:09
  • does the example you wrote work both in python 3 and 2? – Rafael Feb 04 '19 at 03:39
  • 1
    The only Python3.5+ part there is the `pathlib.Path`. In Python 2.7, just use plain old `open(path_to_file_wiith_list)` instead, and no need to `from pathlib import Path`. `concurrent.futures` works the same in Python 2.7 and newer versions. – jsbueno Feb 04 '19 at 14:04
  • This risks overloading your machine. If `list_of_files.txt` contains 1000000 names, then it is likely your machine will slow to a crawl. – Ole Tange Feb 07 '19 at 13:14
  • The Python version creates a pool of workers. The queued future objects use a mini,um of resources - even 1_000_000 names is peanuts in a machine with 4GB main memory. Of course, 40_000_000 would start to be something - and if one have this big a filelist, it is just a matter of taking care to create the futures as well - it would be a matter of 6 or 7 extra LoC above. The shell version yes, launch all processes in parallel immediately - even a couple thousand filenames would overwhelm any machine. – jsbueno Feb 07 '19 at 13:44
1

It is unclear from your question how you run your tasks in serial. But if we assume you run:

python perfile_code.py file1
python perfile_code.py file2
python perfile_code.py file3
:
python perfile_code.py fileN

then the simple way to parallelize this would be:

parallel python perfile_code.py ::: file*

If you have a list of files with one line per file then use:

parallel python perfile_code.py :::: filelist.txt

It will run one job per cpu thread in parallel. So if filelist.txt contains 1000000 names, then it will not run them all at the same time, but only start a new job when one finishes.

Ole Tange
  • 31,768
  • 5
  • 86
  • 104