1

I've implemented a way to have concurrent jobs in bash, as seen here.

I'm looping through a file with around 13000 lines. I'm just testing and printing each line, as such:

#!/bin/bash
max_bg_procs(){
    if [[ $# -eq 0 ]] ; then
        echo "Usage: max_bg_procs NUM_PROCS.  Will wait until the number of background (&)"
        echo "           bash processes (as determined by 'jobs -pr') falls below NUM_PROCS"
        return
    fi
    local max_number=$((0 + ${1:-0}))
    while true; do
        local current_number=$(jobs -pr | wc -l)
        if [[ $current_number -lt $max_number ]]; then
                echo "success in if"
                break
        fi
        echo "has to wait"
        sleep 4
    done
}

download_data(){
    echo "link #" $2 "["$1"]"
}

mapfile -t myArray < $1

i=1
for url in "${myArray[@]}"
do
    max_bg_procs 6
    download_data $url $i &
    ((i++))
done
echo "finito!"

I've also tried other solutions such as this and this, but my issue is persistent:

At a "random" given step, usually between the 2000th and the 5000th iteration, it simply gets stuck. I've put those various echo in the middle of the code to see where it would get stuck but it the last thing it prints is the $url $i.

I've done the simple test to remove any parallelism and just loop the file contents: all went fine and it looped till the end.

So it makes me think I'm missing some limitation on the parallelism, and I wonder if anyone could help me out figuring it out.

Many thanks!

Community
  • 1
  • 1
Miguel
  • 109
  • 6
  • 2
    http://mywiki.wooledge.org/ProcessManagement is a good place to start. – Charles Duffy Mar 02 '17 at 18:24
  • Why do you have extra double quotes around `$1` in `echo "link #" $2 "["$1"]"` – Inian Mar 02 '17 at 18:25
  • 2
    BTW, there are a ton of (unrelated-to-your-immediate-issue) quoting bugs in this code. Consider running your scripts through http://shellcheck.net/ before posting them here. – Charles Duffy Mar 02 '17 at 18:25
  • 2
    in terms of what Inian mentions -- you're quoting exactly the **wrong** things in that code. It's expansions -- like `$2` and `$1` -- that it's most important to quote. (Granted, `#` is also important in this context to prevent it from being treated as a comment character, and quoting `[` and `]` prevents them from being parsed as globs, but `echo "link #${2} [$1]"` would be the Right Thing). – Charles Duffy Mar 02 '17 at 18:27
  • 3
    ...but seriously, use `xargs -d $'\n' -P "$max_number"` or (as much as I hate to suggest the huge mess of perl that it is) GNU parallel for this kind of use case. Job control is principally an *interactive* facility, and while it's possible to do this kind of thing robustly in bash, it's a significant pain -- even tools purportedly built for the job, like `wait -n`, have caveats (for instance, if you have two SICHILDs come in at the same time, `wait -n` can return only once when *two* children exited, meaning you only catch one of them). – Charles Duffy Mar 02 '17 at 18:29
  • 1
    ...and btw, instead of adding `echo`s, run your scripts with `bash -x yourscript` if you want to see what's actually going on at runtime. – Charles Duffy Mar 02 '17 at 18:32
  • Thank you very much for the help and comments. The quotation marks are probably a reflex that I'm quite ignorant in bash and have been working with C# in the past years. I was thinking the same way: quote the text, add the variable. I'm sorry for the confusion it might added. Now I know how it goes. – Miguel Mar 02 '17 at 19:26

2 Answers2

3

Here, we have up to 6 parallel bash processes calling download_data, each of which is passed up to 16 URLs per invocation. Adjust per your own tuning.

Note that this expects both bash (for exported function support) and GNU xargs.

#!/usr/bin/env bash
#              ^^^^- not /bin/sh

download_data() {
  echo "link #$2 [$1]" # TODO: replace this with a job that actually takes some time
}
export -f download_data
<input.txt xargs -d $'\n' -P 6 -n 16 -- bash -c 'for arg; do download_data "$arg"; done' _
Charles Duffy
  • 280,126
  • 43
  • 390
  • 441
  • Thank you very much for all the help. This seems to work perfectly. It hurts me a bit to use a code that I don't understand 80% of it, but hopefully I'll be able to take some time to read about your solution. Do you think the issue I had previously was related to the your comment about the `wait`? Once again, thank you very much for your time and helpfulness. – Miguel Mar 02 '17 at 19:29
2

Using GNU Parallel it looks like this

cat input.txt | parallel echo link '\#{#} [{}]' 

{#} = the job number
{} = the argument

It will spawn one process per CPU. If you instead want 6 in parallel use -j:

cat input.txt | parallel -j6 echo link '\#{#} [{}]' 

If you prefer running a function:

download_data(){
    echo "link #" $2 "["$1"]"
}
export -f download_data
cat input.txt | parallel -j6 download_data {} {#} 
Ole Tange
  • 31,768
  • 5
  • 86
  • 104
  • Maybe edit out the [useless use of `cat`](http://www.iki.fi/era/unix/award.html) or explain how you know it's bad practice but it's a placeholder for something more useful. – tripleee Mar 03 '17 at 08:22
  • Please read http://oletange.blogspot.dk/2013/10/useless-use-of-cat.html and explain why you feel it is bad practice. – Ole Tange Mar 03 '17 at 08:26
  • We have had this discussion before, as I am sure you are able to recall. Including that link is a perfect fix for the immediate problem, which still remains -- perpetrating an antipattern to readers who are otherwise unable to identify it as one. – tripleee Mar 03 '17 at 08:30
  • I want to understand why you still see it as an antipattern. I show in the link that there is no wasted time (unless we are talking high throughput) and an extra process is hardly a problem today. I also give 3 reasons why it is a good idea. I left out the most important reason: Better readability => better maintainability => less human time wasted => less cost. Your link does not address this cost at all. Computing cost used to be so high it was better to sacrifice human time, but that is no longer the case. So, respectfully, can you elaborate why you feel it is bad practice today? – Ole Tange Mar 03 '17 at 11:32
  • There are two fundamental Unix tool design principles here -- commands which operate on contents of files should accept an arbitrary number of file name arguments, and filters should read standard input (and commands which operate on files otherwise turn into filters when invoked without file name arguments). The `cat file | filter` antipattern is obscuring this clean, elegant design to the point where newbies do all kinds of insane things like `more file | wc` because they honestly can't see the pattern. – tripleee Mar 03 '17 at 13:06
  • My students had a hard time grasping the idea of a pipe and would use files all the time, which of course could result in huge performance penalties. I would much prefer them to use a pipe where none need be than to use a temporary file where none need be. Your newbies are apparently different. I am happy that you seem to agree there is no technical advantage of using `<` over `cat |`. I will regard it as a matter of taste. – Ole Tange Mar 03 '17 at 21:03