0

I run a very simple shell script that performs some transformations on files I download every day. Typically it is a zip archive with six files in it that I then process in five different steps before I insert the content into a database. The first step takes 5-8 minutes/file and is limited by the CPU.

I have two computers I perform this task on, one with two cores and one with four cores and hyperthreading. Since the first step takes 30+ minutes in my current setup I would like to multithread it.

The first step is basically

for file in *.txt
        dosomething "$file" "$file.csv"
done

On my 2 core computer I would like to process two files in parallell, on my 8 thread machine I would like to process all six files in parallell (and it would be nice if it the day the archive contains 9 files would handle that nicely). All files must be processed before the next step (which is much faster).

How do I start a suitable number of threads/processes and then don't start the execution of the next step until the previous step is completely finished?

d-b
  • 695
  • 3
  • 14
  • 43
  • Just execute what you want through `parallel`. – Daniel Kamil Kozar Apr 13 '18 at 16:16
  • 1
    You don't need threading. Just run multiple processes. – Charles Duffy Apr 13 '18 at 16:18
  • `for file in *.txt; dosomethingh "$file" "$file.csv" & done` – Charles Duffy Apr 13 '18 at 16:18
  • 2
    @DanielKamilKozar, that's potentially overkill -- `parallel` requires a Perl interpreter to install; the basics of its functionality are present in GNU xargs' `-P` functionality, and in many cases neither of those are needed, as the shell itself can create and manage background processes. – Charles Duffy Apr 13 '18 at 16:23
  • 1
    @CharlesDuffy : Most systems I operate on already do have Perl installed, so I've never noticed that. Thus, `parallel` has simply grown as the go-to idea for such situations in my head. Thanks for the information, I'll keep that in mind. – Daniel Kamil Kozar Apr 13 '18 at 16:24

1 Answers1

1

Shell scripts are not a great place for job distribution. Fundamentally, they just call a sequence of programs which may or may not use multiple cores themselves.

You can still achieve some degree of parallelism by running your jobs in the background (by placing & after your command). This allows your script to continue doing whatever it wants to do while a specific command continues to run in the background. Running the wait command afterwards forces your script to wait for all background jobs to complete before moving on.

You can also store the PIDs of individual commands in an array and wait on those specifically. See this answer for more details on how to do this properly.

For your use case, you could check the number of available cores and background/wait for that many procesess to complete. You can check how many cores you have by grepping /proc/cpuinfo: cat /proc/cpuinfo | grep -c processor

John Moon
  • 924
  • 6
  • 10
  • From [How to Answer](https://stackoverflow.com/help/how-to-answer), see the section "Answer Well-Asked Questions", and therein the bullet point regarding questions which "...have already been asked and answered many times before". – Charles Duffy Apr 13 '18 at 16:20
  • Thank you for your answer. I know that the shell really isn't a multithreaded environment but since my script/task is very simple and basically just emulates a user in an interactive shell and it is pretty easy to "multithread" when performing these commands manually I felt it was worth a try! – d-b Apr 13 '18 at 17:43