Parallel wget download files does not exit properly

Question

I am trying to download files from a file (test.txt) containing links (over 15 000+).

I have this script:

#!/bin/bash

function download {

FILE=$1

while read line; do
        url=$line

        wget -nc -P ./images/ $url

        #downloading images which are not in the test.txt, 
        #by guessing name: 12345_001.jpg, 12345_002.jpg..12345_005.jpg etc.

        wget -nc  -P ./images/ ${url%.jpg}_{001..005}.jpg
done < $FILE

}  

#test.txt contains the URLs
split -l 1000 ./temp/test.txt ./temp/split

#read splitted files and pass to the download function
for f in ./temp/split*; do
    download $f &
done

test.txt:

http://xy.com/12345.jpg
http://xy.com/33442.jpg
...

I am splitting the file into few pieces and daemonize (download $f &) the wget process so it can jump to another splitted file containing the links.

Script is working, but the script does not exit at the end, I must press enter at the end. If I remove & from the line download $f & it works, but I loose the parallel downloading.

Edit:

As I found this is not the best way to parallelize wget downloads. It would be great to use GNU Parallel.

What do you mean it doesn't exit at the end? What does it do when it's all done? — Barmar, Feb 06 '18 at 19:45
kill $$ # Script kills its own process here!? It should exit from this bash script, back to the prompt. — Adrian, Feb 06 '18 at 19:45
That's not needed at the end of a function or script, since a script exits automatically when it reaches the end. — Barmar, Feb 06 '18 at 19:46
Note that `$$` inside the function is the original script's PID, not the PID of the background process running the function. So the function will kill the main script. — Barmar, Feb 06 '18 at 19:47
I think you're mistaken that it's not exiting. Maybe it's printing something in the background, and that's getting printed after the shell prompt, so you have to press enter to get another prompt. — Barmar, Feb 06 '18 at 19:54
What happens if you just type `echo foo` instead of pressing enter? Does it execute the command? — Barmar, Feb 06 '18 at 19:55
I was right, it's displaying progress messages and other results in the background, those are after the prompt. — Barmar, Feb 06 '18 at 19:57
I think you are right, simple turning on wget's quiet mode helped! — Adrian, Feb 06 '18 at 20:00
IMHO you would be MILES BETTER OFF using **GNU Parallel**... https://stackoverflow.com/a/45218013/2836621 It gives you progress bars too. — Mark Setchell, Feb 06 '18 at 20:17
@MarkSetchell This is a good idea, I've been thinking about this. But, I don't know how it would be possible to use with Parallel these two lines "wget -nc -P ./images/ $url", "wget -nc -P ./images/${url%.jpg}_{001..005}.jpg" - especially the second line. Do you have any idea or a quick example? — Adrian, Feb 06 '18 at 20:28
Maybe... but I can't see what your input data looks like. Please click `edit` under your question and add a representative sample. — Mark Setchell, Feb 06 '18 at 20:34

score 3 · Answer 1 · answered Feb 06 '18 at 20:01

3

The script is exiting, but the wget processes in the background are producing output after the script exits, and this gets printed after the shell prompt. So you need to press Enter to get another prompt.

Use the -q option to wget to turn off output.

while read line; do
        url=$line
        wget -ncq -P ./images/ "$url"
        wget -ncq  -P ./images/ "${url%.jpg}"_{001..005}.jpg
done < "$FILE"

answered Feb 06 '18 at 20:01

Barmar

741,623
53
500
612

Or, you could also use the `-b` option to send the process to the background for execution. In which case, all the output is automatically sent to a log file. – darnir Feb 07 '18 at 12:46

Mark Setchell · Accepted Answer · 2018-02-06T21:31:21.380

2

May I commend GNU Parallel to you?

parallel --dry-run -j32 -a URLs.txt 'wget -ncq -P ./images/ {}; wget -ncq  -P ./images/ {.}_{001..005}.jpg'

I am only guessing what your input file looks like in URLs.txt as something resembling:

http://somesite.com/image1.jpg
http://someothersite.com/someotherimage.jpg

Or, using your own approach with a function:

#/bin/bash

# define and export a function for "parallel" to call
doit(){
   wget -ncq -P ./images/ "$1"
   wget -ncq -P ./images/ "$2_{001..005}.jpg"
}
export -f doit

parallel --dry-run  -j32 -a URLs.txt doit {} {.}

edited Feb 06 '18 at 21:31

answered Feb 06 '18 at 20:43

Mark Setchell

191,897
31
273
432

URLs.txt are containg links for images: xy.com/12345.jpg, xy.com/13346.jpg. In first run I am downloading these files, in the second wget command I'll try to download images which are not in the URLs: 12345_001.jpg..12345_005.jpg and 13346_001.jpg..13346_005.jpg. – Adrian Feb 06 '18 at 20:47
Please click `edit` under your question and make sure it correctly shows all the information - it is really hard to read unformatted stuff in the comments area. – Mark Setchell Feb 06 '18 at 20:51

score 1 · Answer 3 · answered Feb 06 '18 at 20:34

1

@Barmar's answer is correct. However, I would like to present a different, more efficient solution. You could look into using Wget2.

Wget2 is the next major version of GNU Wget. It comes with many new features, including multi threaded downloading. So, with GNU wget2, all you would need to do is pass the --max-threads option and select the number of parallel threads you want to spawn.

You can compile it from the git repository very easily. There also exist packages for Arch Linux on the AUR and in Debian

EDIT: Full Disclosure: I am one of the maintainers of GNU Wget and GNU Wget2.

answered Feb 06 '18 at 20:34

darnir

4,870
4
32
47

Thanks for the recommendation, is there any pre built binary for Ubuntu 16.04? – Adrian Feb 06 '18 at 21:22
For Ubuntu, I don't know. The Debian maintainers made a new package for Wget2 with the alpha release. But, compiling Wget is fairly easy, and requires no special steps. – darnir Feb 07 '18 at 12:44

score 0 · Answer 4 · answered Feb 06 '18 at 20:09

Please read wget manual page/ help.

Logging and input file:

-i, --input-file=FILE download URLs found in local or external FILE.

  -o,  --output-file=FILE    log messages to FILE.
  -a,  --append-output=FILE  append messages to FILE.
  -d,  --debug               print lots of debugging information.
  -q,  --quiet               quiet (no output).
  -v,  --verbose             be verbose (this is the default).
  -nv, --no-verbose          turn off verboseness, without being quiet.
       --report-speed=TYPE   Output bandwidth as TYPE.  TYPE can be bits.
  -i,  --input-file=FILE     download URLs found in local or external FILE.
  -F,  --force-html          treat input file as HTML.
  -B,  --base=URL            resolves HTML input-file links (-i -F)
                             relative to URL.
       --config=FILE         Specify config file to use.

Download:

-nc, --no-clobber skip downloads that would download to existing files (overwriting them).

  -t,  --tries=NUMBER            set number of retries to NUMBER (0 unlimits).
       --retry-connrefused       retry even if connection is refused.
  -O,  --output-document=FILE    write documents to FILE.
  -nc, --no-clobber              skip downloads that would download to
                                 existing files (overwriting them).
  -c,  --continue                resume getting a partially-downloaded file.
       --progress=TYPE           select progress gauge type.
  -N,  --timestamping            don't re-retrieve files unless newer than
                                 local.
  --no-use-server-timestamps     don't set the local file's timestamp by
                                 the one on the server.
  -S,  --server-response         print server response.
       --spider                  don't download anything.
  -T,  --timeout=SECONDS         set all timeout values to SECONDS.
       --dns-timeout=SECS        set the DNS lookup timeout to SECS.
       --connect-timeout=SECS    set the connect timeout to SECS.
       --read-timeout=SECS       set the read timeout to SECS.
  -w,  --wait=SECONDS            wait SECONDS between retrievals.
       --waitretry=SECONDS       wait 1..SECONDS between retries of a retrieval.
       --random-wait             wait from 0.5*WAIT...1.5*WAIT secs between retrievals.
       --no-proxy                explicitly turn off proxy.
  -Q,  --quota=NUMBER            set retrieval quota to NUMBER.
       --bind-address=ADDRESS    bind to ADDRESS (hostname or IP) on local host.
       --limit-rate=RATE         limit download rate to RATE.
       --no-dns-cache            disable caching DNS lookups.
       --restrict-file-names=OS  restrict chars in file names to ones OS allows.
       --ignore-case             ignore case when matching files/directories.
  -4,  --inet4-only              connect only to IPv4 addresses.
  -6,  --inet6-only              connect only to IPv6 addresses.
       --prefer-family=FAMILY    connect first to addresses of specified family,
                                 one of IPv6, IPv4, or none.
       --user=USER               set both ftp and http user to USER.
       --password=PASS           set both ftp and http password to PASS.
       --ask-password            prompt for passwords.
       --no-iri                  turn off IRI support.
       --local-encoding=ENC      use ENC as the local encoding for IRIs.
       --remote-encoding=ENC     use ENC as the default remote encoding.
       --unlink                  remove file before clobber.

Follow how to wait wget finished to get more resources

Parallel wget download files does not exit properly

4 Answers4