-1

I want to read links from file, which is passed by argument, and download content from each. How can I do it in parallel with 20 processes? I understand how to do it with an unlimited number of processes:

#!/bin/bash

filename="$1"
mkdir -p saved

while read -r line; do
    url="$line"
    name_download_file_sha="$(echo $url | sha256sum | awk '{print $1}').jpeg"
    curl -L $url > saved/$name_download_file_sha &

done < "$filename"
wait
RazDva
  • 104
  • 10
  • The shell does not expose threading at all. You can easily use parallel _processes_ by backgrounding each job with `&` like you are doing here. You are apparently actually asking how to limit the number of concurrent processes. I have updated your question accordingly. – tripleee Aug 11 '21 at 18:19
  • The question and code does not match. What exactly is an issue? Each curl command is running in background and hence kind of parallel. You want to read 20 files instead of 1? Or you want to run 20 curl (but then why?) – Aval Sarri Aug 11 '21 at 18:26

2 Answers2

1

You can add this test :

    until [ "$( jobs -lr 2>&1 | wc -l)"  -lt 20 ]; do
        sleep 1
    done

This will maintain maximum 21 instance of curl in parallel . And wait until you reach 19 or a lower value to start another one .

If you are using GNU sleep , you can do sleep 0.5 , to optimize the wait time

So you code will be

#!/bin/bash

filename="$1"
mkdir -p saved

while read -r line; do
    until [ "$( jobs -lr 2>&1 | wc -l)"  -lt 20 ]; do
        sleep 1
    done
    url="$line"
    name_download_file_sha="$(echo $url | sha256sum | awk '{print $1}').jpeg"
    curl -L $url > saved/$name_download_file_sha &

done < "$filename"
wait
EchoMike444
  • 1,513
  • 1
  • 9
  • 8
  • Due to the sleep it is possible (although unlikely) that this parallelized script takes longer than running all jobs sequentially. You can replace `until` by `if` and use `wait -n` instead. – Socowi Aug 12 '21 at 06:32
  • @Socowi i will take a look at your sugestion – EchoMike444 Aug 12 '21 at 15:13
0

xargs -P is the simple solution. It gets somewhat more complicated when you want to save to separate files, but you can use sh -c to add this bit.

: ${processes:=20}
< $filename xargs -P $processes -I% sh -c '
    line="$1"
    url_file="$line"
    name_download_file_sha="$(echo $url_file | sha256sum | awk "{print \$1}").jpeg"
    curl -L $url > saved/$name_download_file_sha 
' -- %

Based on triplee's suggestions, I've lower-cased the environment variable and changed its name to 'processes' to be more correct.

I've also made the suggested corrections to the awk script to avoid quoting issues.

You may still find it easier to replace the awk script with cut -f1, but you'll need to specify the cut delimeter if it's spaces (not tabs).

Mark
  • 4,249
  • 1
  • 18
  • 27
  • You can trivially switch to double quotes around the Awk script; just backslash-escape any backslashes, double quotes, (backticks,) and dollar signs. – tripleee Aug 11 '21 at 18:26
  • Those are processes, not threads. Probably avoid uppercase for your private variables; see also https://stackoverflow.com/questions/673055/correct-bash-and-shell-script-variable-capitalization – tripleee Aug 11 '21 at 18:28
  • @tripleee: Applied your suggestions. – Mark Aug 11 '21 at 18:36