2

I am new to bash.

I want to wget some resources in parallel.

What is the problem in the following code:

for item in $list
do
  if [ $i -le 10 ];then
    wget -b $item
    let "i++"
  else
    wait
    i=1
  fi

When I execute this shell. Error throwed:

fork: Resource temporarily unavailable

My question is how to use wget right way.

Edit:

My problem is there is about four thousands of url to download, if I let all these jobs work in parallel, fork: Resource temporarily unavailable will throw out. I don't know how to control the count in parallel.

jasonxia23
  • 147
  • 1
  • 12
  • 2
    `wget -b $item` launch `wget` in the foreground. Use `wget -b $item &` to launch in the background and allow parallelization. Note that variable `i` looks uninitialized... – Renaud Pacalet Jan 19 '18 at 14:46
  • 3
    Note that the [GNU parallel](https://www.gnu.org/software/parallel/) utility does what you want and much more. – Renaud Pacalet Jan 19 '18 at 14:48
  • 1
    `wget "$item" &` (instead of `wget -b`) is also fine to run wget in background; that way, `bash` knows it's background and you can use `jobs` to see/control. I personally prefer `curl` or `lftp`; I rarely use `wget`. – Bach Lien Jan 21 '18 at 13:12
  • @RenaudPacalet after i spent some time on this, i found i just need to change `wget -b $item` to `wget $item &`. yeah, that is what you just said. so what is the difference between `-b` and `&` – jasonxia23 Jan 21 '18 at 13:43

3 Answers3

4

Use jobs|grep to check background jobs:

#!/bin/bash

urls=('www.cnn.com' 'www.wikipedia.org')  ## input data

for ((i=-1;++i<${#urls[@]};)); do
  curl -L -s ${urls[$i]} >file-$i.html &  ## background jobs
done

until [[ -z `jobs|grep -E -v 'Done|Terminated'` ]]; do
  sleep 0.05; echo -n '.'                 ## do something while waiting
done

echo; ls -l file*\.html                   ## list downloaded files

Results:

............................
-rw-r--r-- 1 xxx xxx 155421 Jan 20 00:50 file-0.html
-rw-r--r-- 1 xxx xxx  74711 Jan 20 00:50 file-1.html

Another variance, tasks in simple parallel:

#!/bin/bash

urls=('www.yahoo.com' 'www.hotmail.com' 'stackoverflow.com')

_task1(){                                  ## task 1: download files
  for ((i=-1;++i<${#urls[@]};)); do
    curl -L -s ${urls[$i]} >file-$i.html &
  done; wait
}
_task2(){ echo hello; }                    ## task 2: a fake task
_task3(){ echo hi; }                       ## task 3: a fake task

_task1 & _task2 & _task3 &                 ## run them in parallel
wait                                       ## and wait for them

ls -l file*\.html                          ## list results of all tasks
echo done                                  ## and do something

Results:

hello
hi
-rw-r--r-- 1 xxx xxx 320013 Jan 20 02:19 file-0.html
-rw-r--r-- 1 xxx xxx   3566 Jan 20 02:19 file-1.html
-rw-r--r-- 1 xxx xxx 253348 Jan 20 02:19 file-2.html
done

Example with limit how many downloads in parallel at a time (max=3):

#!/bin/bash

m=3                                            ## max jobs (downloads) at a time
t=4                                            ## retries for each download

_debug(){                                      ## list jobs to see (debug)
  printf ":: jobs running: %s\n" "$(echo `jobs -p`)"
}

## sample input data
## is redirected to filehandle=3
exec 3<<-EOF
www.google.com google.html
www.hotmail.com hotmail.html
www.wikipedia.org wiki.html
www.cisco.com cisco.html
www.cnn.com cnn.html
www.yahoo.com yahoo.html
EOF

## read data from filehandle=3, line by line
while IFS=' ' read -u 3 -r u f || [[ -n "$f" ]]; do
  [[ -z "$f" ]] && continue                  ## ignore empty input line
  while [[ $(jobs -p|wc -l) -ge "$m" ]]; do  ## while $m or more jobs in running
    _debug                                   ## then list jobs to see (debug)
    wait -n                                  ## and wait for some job(s) to finish
  done
  curl --retry $t -Ls "$u" >"$f" &           ## download in background
  printf "job %d: %s => %s\n" $! "$u" "$f"   ## print job info to see (debug)
done

_debug; wait; ls -l *\.html                  ## see final results

Outputs:

job 22992: www.google.com => google.html
job 22996: www.hotmail.com => hotmail.html
job 23000: www.wikipedia.org => wiki.html
:: jobs running: 22992 22996 23000
job 23022: www.cisco.com => cisco.html
:: jobs running: 22996 23000 23022
job 23034: www.cnn.com => cnn.html
:: jobs running: 23000 23022 23034
job 23052: www.yahoo.com => yahoo.html
:: jobs running: 23000 23034 23052
-rw-r--r-- 1 xxx xxx  61473 Jan 21 01:15 cisco.html
-rw-r--r-- 1 xxx xxx 155055 Jan 21 01:15 cnn.html
-rw-r--r-- 1 xxx xxx  12514 Jan 21 01:15 google.html
-rw-r--r-- 1 xxx xxx   3566 Jan 21 01:15 hotmail.html
-rw-r--r-- 1 xxx xxx  74711 Jan 21 01:15 wiki.html
-rw-r--r-- 1 xxx xxx 319967 Jan 21 01:15 yahoo.html

After reading your updated question, I think it is much easier to use lftp, which can log and download (automatically follow-link + retry-download + continue-download); you'll never need to worry about job/fork resources because you run only a few lftp commands. Just plit your download list into some smaller lists, and lftp will download for you:

$ cat downthemall.sh 
#!/bin/bash

## run: lftp -c 'help get'
## to know how to use lftp to download files
## with automatically retry+continue

p=()                                     ## pid list

for l in *\.lst; do
  lftp -f "$l" >/dev/null &              ## run proccesses in parallel
  p+=("--pid=$!")                        ## record pid
done

until [[ -f d.log ]]; do sleep 0.5; done ## wait for the log file
tail -f d.log ${p[@]}                    ## print results when downloading

Outputs:

$ cat 1.lst 
set xfer:log true
set xfer:log-file d.log
get -c http://www.microsoft.com -o micro.html
get -c http://www.cisco.com     -o cisco.html
get -c http://www.wikipedia.org -o wiki.html

$ cat 2.lst 
set xfer:log true
set xfer:log-file d.log
get -c http://www.google.com    -o google.html
get -c http://www.cnn.com       -o cnn.html
get -c http://www.yahoo.com     -o yahoo.html

$ cat 3.lst 
set xfer:log true
set xfer:log-file d.log
get -c http://www.hp.com        -o hp.html
get -c http://www.ibm.com       -o ibm.html
get -c http://stackoverflow.com -o stack.html

$  rm *log *html;./downthemall.sh
2018-01-22 02:10:13 http://www.google.com.vn/?gfe_rd=cr&dcr=0&ei=leVkWqiOKfLs8AeBvqBA -> /tmp/1/google.html 0-12538 103.1 KiB/s
2018-01-22 02:10:13 http://edition.cnn.com/ -> /tmp/1/cnn.html 0-153601 362.6 KiB/s
2018-01-22 02:10:13 https://www.microsoft.com/vi-vn/ -> /tmp/1/micro.html 0-129791 204.0 KiB/s
2018-01-22 02:10:14 https://www.cisco.com/ -> /tmp/1/cisco.html 0-61473 328.0 KiB/s
2018-01-22 02:10:14 http://www8.hp.com/vn/en/home.html -> /tmp/1/hp.html 0-73136 92.2 KiB/s
2018-01-22 02:10:14 https://www.ibm.com/us-en/ -> /tmp/1/ibm.html 0-32700 131.4 KiB/s
2018-01-22 02:10:15 https://vn.yahoo.com/?p=us -> /tmp/1/yahoo.html 0-318657 208.4 KiB/s
2018-01-22 02:10:15 https://www.wikipedia.org/ -> /tmp/1/wiki.html 0-74711 60.7 KiB/s
2018-01-22 02:10:16 https://stackoverflow.com/ -> /tmp/1/stack.html 0-253033 180.8
Bach Lien
  • 1,030
  • 6
  • 7
  • It's better: https://stackoverflow.com/questions/1131484/wait-for-bash-background-jobs-in-script-to-be-finished – Bach Lien Jan 19 '18 at 17:57
  • thanks for your answer, my problem is there is about four thousands of url to download, if I let all these jobs work in parallel, `fork: Resource temporarily unavailable` will throw out. I don't know how to control the count in parallel. – jasonxia23 Jan 20 '18 at 15:00
  • if you want 10 downloads in parallel, then run 10 tasks in parallel, each task download a list of file sequentially; however, i recommend to use a full function downloader, not a bash script, if you really to download that many files – Bach Lien Jan 20 '18 at 15:06
  • how can I achieve this, sorry, I am really new to shell scripting. – jasonxia23 Jan 20 '18 at 15:12
  • 1
    I've add another example, where max=3 files is downloaded concurrently. – Bach Lien Jan 20 '18 at 16:15
  • Good description and variants. – iamauser Jan 20 '18 at 22:36
2

With updated question, here is an updated answer.

Following script launches 10 (can be changed to any number) wget processes in the background and monitors them. Once one of the process finishes, it gets the next one in the list and tries to keep the same $maxn(10) process running in the background, until it runs out of the urls from the list($urlfile). There are inline comments to help understand.

$ cat wget.sh
#!/bin/bash

wget_bg()
{
    > ./wget.pids # Start with empty pidfile
    urlfile="$1"
    maxn=$2
    cnt=0;
    while read -r url
    do
        if [ $cnt -lt $maxn ] && [ ! -z "$url" ]; then # Only maxn processes will run in the background
            echo -n "wget $url ..."
            wget "$url" &>/dev/null &
            pidwget=$! # This gets the backgrounded pid
            echo "$pidwget" >> ./wget.pids # fill pidfile
            echo "pid[$pidwget]"
            ((cnt++));
        fi
        while [ $cnt -eq $maxn ] # Start monitoring as soon the maxn process hits
        do
            while read -r pids
            do
                if ps -p $pids > /dev/null; then # Check pid running
                  :
                else
                  sed -i "/$pids/d" wget.pids # If not remove it from pidfile
                  ((cnt--)); # decrement counter
                fi
            done < wget.pids
        done
    done < "$urlfile"
}    
# This runs 10 wget processes at a time in the bg. Modify for more or less.
wget_bg ./test.txt 10 

To run:

$ chmod u+x ./wget.sh 
$ ./wget.sh
wget blah.com ...pid[13012]
wget whatever.com ...pid[13013]
wget thing.com ...pid[13014]
wget foo.com ...pid[13015]
wget bar.com ...pid[13016]
wget baz.com ...pid[13017]
wget steve.com ...pid[13018]
wget kendal.com ...pid[13019]
iamauser
  • 11,119
  • 5
  • 34
  • 52
  • I think OP wants to do download things in parallel. – Bach Lien Jan 19 '18 at 18:14
  • Not sure why you have to wait if you want all the processes to run in parallel for different urls. – iamauser Jan 19 '18 at 18:17
  • I think OP wants to do something while waiting for all files to be downloaded, not waiting and do nothing; then he/she would handle the files after all of them are downloaded. Downloading in parallel will be faster, especially when downloading from different sources. – Bach Lien Jan 19 '18 at 18:19
  • Ideally, files should be checked if they are downloaded successfully; but, it is another thing, not parallelism technique. I think this question is about parallelism, so, downloading files one by one sequentially is not really the answer. – Bach Lien Jan 19 '18 at 18:28
  • I believe OP's question and example needs a little more clarity than it is now. – iamauser Jan 19 '18 at 18:31
  • @iamauser sorry for the ambiguity. I have updated my question. – jasonxia23 Jan 20 '18 at 15:04
  • @jasonxia23 see my updated answer, hope it works for you. – iamauser Jan 20 '18 at 22:23
-2

Add this in your if statement :

until wget -b $item do
    printf '.'
    sleep 2
done

The loop will wait process finished and print a "." every 2sec

Léo R.
  • 2,620
  • 1
  • 10
  • 22