download files parallely in a bash script

Question

I am using the below logic to download 3 file from the array at once, once all 3 completed only the next 3 files will be picked up.

parallel=3
       downLoad() {
                while (( "$#" )); do
                    for (( i=0; i<$parallel; i++ )); do
                        echo "downloading ${1}..."
                        curl -s -o ${filename}.tar.gz <download_URL> &
                        shift
                    done
                    wait
                                echo "#################################"
            done
        }

       downLoad ${layers[@]}

But how i am expecting is "at any point in time 3 downloads should run" - i mean suppose we sent 3 file-downloads to background and one among the 3 gets completed very soon because of very less size, i want another file from the queue should be send for download.

COMPLETE SCRIPT:

#!/bin/bash

set -eu

reg="registry.hub.docker.com"
repo="hjd48"
image="redhat"
name="${repo}/${image}"
tag="latest"
parallel=3

# Get auth token
token=$( curl -s "https://auth.docker.io/token?service=registry.docker.io&scope=repository:${name}:pull" | jq -r .token )

# Get layers
resp=$(curl -s -H "Authorization: Bearer $token" "https://${reg}/v2/${name}/manifests/${tag}" | jq -r .fsLayers[].blobSum )
layers=( $( echo $resp | tr ' ' '\n' | sort -u ) )

prun() {
    PIDS=()
    while (( "$#" )); do
        if ( kill -0 ${PIDS[@]} 2>/dev/null ; [[ $(( ${#PIDS[@]} - $? )) -lt $parallel ]])
        then
                echo "Download: ${1}.tar.gz"
                curl -s -o $1.tar.gz -L -H "Authorization: Bearer $token" "https://${reg}/v2/${name}/blobs/${1}" &
                PIDS+=($!)
                shift
        fi
    done
    wait
}

prun ${layers[@]}

I think you should gather the 'child pid's' of the forked processes, and keep checking if they are done yet (in a loop). You can use the $! variable, gather them, and use them to check the processes. See also https://stackoverflow.com/questions/9117507/linux-unix-command-to-determine-if-process-is-running — twicejr, Mar 15 '18 at 09:54

Shakiba Moshiri · Accepted Answer · 2018-03-15T13:25:39.810

If you do not mind using xargs then you can:

xargs -I xxx -P 3 sleep xxx < sleep

and sleep is:

and if you watch the background with:

watch -n 1 -exec ps  --forest -g -p your-Bash-pid

(sleep could be your array of link ) then you will see that 3 jobs are run in parallel and when one of these three is completed the next job is added. In fact always 3 jobs are running till the end of array.

sample output of watch(1):

12260 pts/3    S+     0:00  \_ xargs -I xxx -P 3 sleep xxx
12263 pts/3    S+     0:00      \_ sleep 1
12265 pts/3    S+     0:00      \_ sleep 2
12267 pts/3    S+     0:00      \_ sleep 3

xargs starts with 3 jobs and when one of them is finished it will add the next which bacomes:

12260 pts/3    S+     0:00  \_ xargs -I xxx -P 3 sleep xxx
12265 pts/3    S+     0:00      \_ sleep 2
12267 pts/3    S+     0:00      \_ sleep 3
12269 pts/3    S+     0:00      \_ sleep 4 # this one was added

--verbose options is exposing my username and password/token too..! can't have a control on this..!? xargs -I nnn -P 3 --verbose curl -s -o nnn.tar.gz -u$user:$token https://docker.repositories.sap.ondemand.com/nnn < downloadURLS — Rohith, Mar 16 '18 at 06:32
@Rohith **xargs** using `-t` or `--verbose` sends command and its arguments to standard-error (=2). Why do you want to use `verbose`? I am not much familiar with `curl` but I think `curl` exposes them and not `xargs` — Shakiba Moshiri, Mar 16 '18 at 07:25

score 1 · Answer 2 · answered Mar 15 '18 at 10:14

I've done just this by using trap to handle SIGCHLD and start another transfer when one ends.

The difficult part is that once your script installs a SIGCHLD handler with that trap line, you can't create any child processes other than your transfer processes. For example, if your shell doesn't have a built-in echo, calling echo would spawn a child process that would cause you to start one more transfer when the echo process ends.

I don't have a copy available, but it was something like this:

startDownload() {
   # only start another download if there are URLs left in
   # in the array that haven't been downloaded yet
   if [ ${ urls[ $fileno ] } ];
       # start a curl download in the background and increment fileno
       # so the next call downloads the next URL in the array
       curl ... ${ urls[ $fileno ] } &
       fileno=$((fileno+1))
   fi
}

trap startDownload SIGCHLD

# start at file zero and set up an array
# of URLs to download
fileno=0
urls=...

parallel=3

# start the initial parallel downloads
# when one ends, the SIGCHLD will cause
# another one to be started if there are
# remaining URLs in the array
for (( i=0; i<$parallel; i++ )); do
    startDownload
done

wait

That's not been tested at all, and probably has all kinds of errors.

It's not downloading all the files in the array..! it only downloads 3 and exits..! :( — Rohith, Mar 15 '18 at 11:40
@Rohith I said it wasn't tested. ;-) You probably need to check what the `if` line in `startDownload` is doing - if your shell has a built-in `echo`, you can put `echo` statements in the function. What's probably happening is related to this: https://stackoverflow.com/questions/13219634/easiest-way-to-check-for-an-index-or-a-key-in-an-array — Andrew Henle, Mar 15 '18 at 11:56

score 0 · Answer 3 · answered Mar 15 '18 at 10:11

I would read all provided filenames into three variables, and then process each stream separately, e.g.

PARALLEL=3
COUNTER=1
for FILENAME in $#
do
        eval FILESTREAM${COUNTER}="\$FILESTREAM${COUNTER} \${FILENAME}"
        COUNTER=`expr ${COUNTER} + 1`

        if [ ${COUNTER} -gt ${PARALLEL} ]
        then
                COUNTER=1
        fi
done

and now call the download function for each of the streams in parallel:

COUNTER=1

while [ ${COUNTER} -le ${PARALLEL} ]
do
        eval "download \$FILESTREAM${COUNTER} &"
        COUNTER=`expr ${COUNTER} + 1`
done

score 0 · Answer 4 · answered Mar 15 '18 at 11:19

Besides implementing a parallel bash script from scratch, GNU parallel is an available tool to use which is quite suitable to perform these type of tasks.

parallel -j3 curl -s -o {}.tar.gz download_url ::: "${layers[@]}"

-j3 ensures a maximum of 3 jobs running at the same time
you can add an additional option --dry-run after parallel to make sure the built command is exactly as you want

download files parallely in a bash script

4 Answers4