0

So guys,

I need your help trying to identify the fastest and the most "fault" tolerant solution to my problem. I have a shell script which executes some functions, based on a txt file, in which I have a list of files. The list can contain from 1 file to X files. What I would like to do is iterate over the content of the file and execute my scripts for only 4 items out of the file. Once the functions have been executed for these 4 files, go over to the next 4 .... and keep on doing so until all the files from the list have been "processed".

My code so far is as follows.

#!/bin/bash

number_of_files_in_folder=$(cat list.txt | wc -l)
max_number_of_files_to_process=4
Translated_files=/home/german_translated_files/

while IFS= read -r files
do  
        while [[ $number_of_files_in_folder -gt 0 ]]; do
            i=1
            while [[ $i -le $max_number_of_files_to_process ]]; do
                my_first_function "$files" &                                                  # I execute my translation function for each file, as it can only perform 1 file per execution 
                find /home/german_translator/ -name '*.logs' -exec mv {} $Translated_files \; # As there will be several files generated, I have them copied to another folder
                sed -i "/$files/d" list.txt                                                   # We remove the processed file from within our list.txt file.
                my_second_function                                                            # Without parameters as it will process all the files copied at step 2.
            done
            # here, I want to have all the files processed and don't stop after the first iteration
        done
done < list.txt

Unfortunately, as I am not quite good at shell scripting, I do not know how to structure it so that it won't waste any resources and mostly, to make sure that it "processes" everything from that file. Do you have any advice on how to achieve what I am trying to achieve?

Paul C.
  • 167
  • 1
  • 8
  • 1
    Why do you need `number_of_files_in_folder` at all? Can't you just iterate over the files without needing to have a count? – Charles Duffy Aug 23 '22 at 17:35
  • 2
    Also, are you _very_ sure it's okay for `my_first_function` to run in the background? That means you don't know when it's finished -- it can still be running when `my_second_function` starts, or the `sed` can run before `my_first_function` has finished initializing and starts its actual logic. – Charles Duffy Aug 23 '22 at 17:36
  • 1
    ...also, _in general_, using `sed` to edit a file _while you're reading from that file_ is a bad idea unless it's something you have an extremely good reason to do; you can process the file in 4-line chunks without needing to remove those lines as you go. – Charles Duffy Aug 23 '22 at 17:36
  • 1
    ...an important thing: The ` – Charles Duffy Aug 23 '22 at 17:38
  • 1
    BTW, I'm curious -- why batches of four? Is this something like a CPU resource constraint issue? (if so, `xargs -d $'\n' -P 4` might be useful for you, to keep four processes running at any given time). – Charles Duffy Aug 23 '22 at 17:39
  • 1
    ...anyhow -- I've added a bunch of comments rather than an actual answer because it's hard to tell what the narrow, specific technical issue your question is about actually is. Clarifying that would probably be helpful for you to get answers that concretely address your problem. – Charles Duffy Aug 23 '22 at 17:42
  • ...btw, `while IFS= read -r line1 && IFS= read -r line2 && IFS= read -r line3 && IFS= read -r line4` will read four lines per iteration of your loop (though if the total number of lines isn't divisible by four you can have some items not processed, so I don't recommend it for your use case; really, `xargs -P 4` is the best choice unless the items need to be processed in a way that refers to the other items sharing the same batch). – Charles Duffy Aug 23 '22 at 17:51
  • Hi @CharlesDuffy , thank you for your answer. To narrow everything down, let me try and answer each comment: comment 1 -> I created that variable because I was not sure I can iterate over the file, in chunks of 4. comment 2 -> No, I am not very sure it is ok, but I had no other idea. It is best to wait until my_first_function is done and only then proceed with the other steps. comment 3 -> I was using sed because I did not know how to iterate in chunks of 4 and my only option was to remove those lines and keep on going until I have no other files in the list. – Paul C. Aug 23 '22 at 17:58
  • comment 4 -> well you are right. But it was my only idee to process all those files in chunks of 4. comment 5 -> batches of 4 because if I execute the functions on more than 4 files, I will run out of resources and therefore I need to make sure that we have a max of 4 files with each iteration. Short version: - I do not really know how to iterate over the content of the list.txt but in chunks of 4 ( or less ) and keep on doing that until the list has been read and processed completly. – Paul C. Aug 23 '22 at 17:59
  • Why do you not process line by line but 4 lines by 4 lines? How is each line structured? You might have some luck by using `paste` to first merge 4 consecutive lines, then process this one line – knittl Aug 23 '22 at 18:06
  • Okay, that being the case, what you should do is `xargs -d $'\n' -P 4 /path/to/command-to-process-one-file – Charles Duffy Aug 23 '22 at 18:15
  • See [parallelize bash script with maximum number of processes](https://stackoverflow.com/questions/38160/parallelize-bash-script-with-maximum-number-of-processes). (I recommend xargs over GNU parallel because xargs is the simpler tool -- it doesn't try to be as [offensively clever](https://lists.gnu.org/archive/html/bug-parallel/2015-05/msg00005.html), so its behavior is simpler and more predictable; I also admit to a personal prejudice against anything written in perl). – Charles Duffy Aug 23 '22 at 18:17

2 Answers2

0

only 4 items out of the file. Once the functions have been executed for these 4 files, go over to the next 4

Seems to be quite easy with xargs.

your_function() {
   echo "Do something with $1 $2 $3 $4"
}
export -f your_function

xargs -d '\n' -n 4 bash -c 'your_function "$@"' _ < list.txt
  • xargs -d '\n' for each line
  • -n 4 take for arguments
  • bash .... - run this command with 4 arguments
  • _ - the syntax is bash -c <script> $0 $1 $2 etc..., see man bash.
  • "$@" - forward arguments
  • export -f your_function - export your function to environment so child bash can pick it up.

I execute my translation function for each file

So you execute your translation function for each file, not for each 4 files. If the "translation function" is really for each file with no inter-file state, consider rather executing 4 processes in parallel with same code and just xargs -P 4.

KamilCuk
  • 120,984
  • 8
  • 59
  • 111
0

If you have GNU Parallel it looks something like this:

doit() {
    my_first_function "$1"
    my_first_function "$2"
    my_first_function "$3"
    my_first_function "$4"
    my_second_function "$1" "$2" "$3" "$4"
}
export -f doit

cat list.txt | parallel -n4 doit

Ole Tange
  • 31,768
  • 5
  • 86
  • 104