Multiple simultaneous downloads using Wget?

Question

I'm using wget to download website content, but wget downloads the files one by one.

How can I make wget download using 4 simultaneous connections?

A similar question with a nice solution: http://stackoverflow.com/questions/7577615/parallel-wget-in-bash — JohnEye, Jun 03 '14 at 00:10
Have a look at this project https://github.com/rockdaboot/wget2 — user9869932, Sep 22 '16 at 23:59
For those seeing the above comment regarding Wget2, please use the new repository location: https://gitlab.com/gnuwget/wget2. It is the official location for GNU Wget2, the spiritual successor to Wget — darnir, Feb 05 '21 at 08:48

score 214 · Answer 1 · edited Nov 29 '22 at 21:45

214

Use the aria2:

aria2c -x 16 [url]
#          |
#          |
#          |
#          ----> the number of connections

http://aria2.sourceforge.net

edited Nov 29 '22 at 21:45

blackgreen

34,072
23
111
129

answered Nov 15 '12 at 11:58

gmarian

2,485
1
13
8

30

I don't see how this helps download a website - it looks like it only downloads 1 file. If this is true - the votes should be -ve. – Stephen Nov 10 '13 at 22:42
9

I agree, this is not a good answer, because aria2 cannot do web or ftp mirroring like wget or lftp. lftp does mirroring as well as supporting multiple connections. – Anachronist Jan 11 '14 at 02:42
10

Don't forget `-s` to specify the number of splits, and `-k` to specify the minimum size per split segment - otherwise you might never reach the `-x` max connections. – Bob Mar 11 '14 at 13:16
3

@Stephen this is to download very large files much faster *from* websites by using multiple sockets to the server instead of one. This is not mean for scraping a website. – gabeio Feb 04 '15 at 22:10
does not supports socks* – Fedcomp May 16 '16 at 20:53
aria2c is a multi-protocol download manager pretty weird that this answer has 155 UP Votes see my answer below for the best tool out there to do the job – pouya Aug 14 '17 at 21:40
FWIW no combination of any args listed in this answer and below get me any more than 1 connection to download one large file. ...? – Cory Mawhorter Sep 19 '18 at 01:14
Just a note -- the option for parallel downloads is now `-j`, like with `make`. – StarDust Jan 11 '22 at 04:38
The author EXPLICITLY SAID "USING WGET". – Szczepan Hołyszewski Nov 29 '22 at 18:39

score 130 · Answer 2 · edited Nov 19 '21 at 08:52

130

Wget does not support multiple socket connections in order to speed up download of files.

I think we can do a bit better than gmarian answer.

The correct way is to use aria2.

aria2c -x 16 -s 16 [url]
#          |    |
#          |    |
#          |    |
#          ---------> the number of connections here

Official documentation:

-x, --max-connection-per-server=NUM: The maximum number of connections to one server for each download. Possible Values: 1-16 Default: 1

-s, --split=N: Download a file using N connections. If more than N URIs are given, first N URIs are used and remaining URLs are used for backup. If less than N URIs are given, those URLs are used more than once so that N connections total are made simultaneously. The number of connections to the same host is restricted by the --max-connection-per-server option. See also the --min-split-size option. Possible Values: 1-* Default: 5

edited Nov 19 '21 at 08:52

Olivier Pons

15,363
26
117
213

answered Jun 27 '14 at 05:42

thomas.han

2,891
2
17
14

25

To document `-x, --max-connection-per-server=NUM The maximum number of connections to one server for each download. Possible Values: 1-16 Default: 1` and `-s, --split=N Download a file using N connections. If more than N URIs are given, first N URIs are used and remaining URLs are used for backup. If less than N URIs are given, those URLs are used more than once so that N connections total are made simultaneously. The number of connections to the same host is restricted by the --max-connection-per-server option. See also the --min-split-size option. Possible Values: 1-* Default: 5` – Nick Apr 07 '16 at 05:57
1

Thanks for elaborating on the parameters, Nick. – thomas.han Apr 07 '16 at 11:10
8

The option -s alone no longer split a file from a single server since the 1.10 release. One needs to use --max-connection-per-server together to force establish multiple connections. See aria2 documentation: `About the number of connections Since 1.10.0 release, aria2 uses 1 connection per host by default and has 20MiB segment size restriction. So whatever value you specify using -s option, it uses 1 connection per host. To make it behave like 1.9.x, use --max-connection-per-server=4 --min-split-size=1M.` – Samuel Li Sep 09 '16 at 05:04
2

The shorthand of @SamuelLi's update is `aria2c -x 4 -k 1M url` and worked well for me (a server with a limit of 100k per connection let me download at 400k with said parameters) – EkriirkE Nov 22 '18 at 18:37
3

Critically, `aria2` does *not* support recursive HTTP downloads, making it a substandard replacement for `wget` if `-r` is desired. – user2943160 Jan 09 '20 at 01:40
@user2943160 good point. plus OP asks about wget. – avia Sep 17 '21 at 09:56

Nikolay Shmyrev · Answer 3 · 2016-04-02T19:08:13.747

74

Since GNU parallel was not mentioned yet, let me give another way:

cat url.list | parallel -j 8 wget -O {#}.html {}

edited Apr 02 '16 at 19:08

answered Jul 31 '15 at 16:46

Nikolay Shmyrev

24,897
5
43
87

6

That's interesting approach. Not really applicable when you need to download a huge file and you get limited speed per connection, but can be useful when downloading multiple files. – Nikola Petkanski Feb 15 '16 at 18:37
1

Running this command would run the list 8 times, no? I did it the same way and instead of processing each line with 8 parallels, it just processes the whole list 8 times. – DomainsFeatured Sep 18 '16 at 21:28
8

No, it splits the list on 8 jobs – Nikolay Shmyrev Sep 18 '16 at 21:31
1

Okay, I'm definitely doing something weird. Will figure it out. Thanks for the quick response. – DomainsFeatured Sep 18 '16 at 21:34
2

That's a [useless use of `cat`](http://www.iki.fi/era/unix/award.html), though. In this limited context, it's quite harmless, but maybe you don't want to perpetrate this antipattern. – tripleee Aug 14 '17 at 07:51
1

I did not know about this yet. Thank you for this. – User9102d82 Jun 13 '18 at 18:22
Beautiful, that is what answers the closest that question. It should be the accepted answer. – avia Sep 17 '21 at 09:55
Literally the only answer that worked for me. Thank you. – Ali Sep 20 '21 at 18:31

score 44 · Answer 4 · edited Aug 10 '17 at 12:54

44

I found (probably) a solution

In the process of downloading a few thousand log files from one server to the next I suddenly had the need to do some serious multithreaded downloading in BSD, preferably with Wget as that was the simplest way I could think of handling this. A little looking around led me to this little nugget:
wget -r -np -N [url] &
wget -r -np -N [url] &
wget -r -np -N [url] &
wget -r -np -N [url]
Just repeat the wget -r -np -N [url] for as many threads as you need... Now given this isn’t pretty and there are surely better ways to do this but if you want something quick and dirty it should do the trick...

Note: the option -N makes wget download only "newer" files, which means it won't overwrite or re-download files unless their timestamp changes on the server.

edited Aug 10 '17 at 12:54

aioobe

413,195
112
811
826

answered Oct 04 '11 at 08:37

SMUsamaShah

7,677
22
88
131

12

But doesn't that download the whole set of artifacts for each process? – Kai Mattern Feb 17 '14 at 12:15
17

@KaiMattern: add the `-nc` option: "no clobber" - it causes wget to ignore aready downloaded (even partially) files. – SF. May 12 '15 at 22:20
2

I had a list of images I needed to download, and this worked for me as well: `wget -i list.txt -nc & wget -i list.txt -nc & wget -i list.txt -nc` Very ugly, but hey, it works. :P – Jared Dunham Sep 22 '16 at 20:50
3

Having one of those connections broken for some reason gives you uncompleted files, without touched by other connections. This method creates integrity issues. – muhammedv Mar 06 '17 at 10:59
2

The `-b` flag will run the wget process in the background, as an alternative to bash's `&` job control built-in. STDOUT will be written to wget-log if `-o ` is not specified. Good for scripting. See wget(1) for more details. – Paul Dec 29 '17 at 18:22

score 32 · Answer 5 · edited Dec 25 '21 at 20:34

32

A new (but yet not released) tool is Mget. It has already many options known from Wget and comes with a library that allows you to easily embed (recursive) downloading into your own application.

To answer your question:

mget --num-threads=4 [url]

UPDATE

Mget is now developed as Wget2 with many bugs fixed and more features (e.g. HTTP/2 support).

--num-threads is now --max-threads.

edited Dec 25 '21 at 20:34

rogerdpack

62,887
36
269
388

answered Oct 07 '13 at 18:36

rockdaboot

706
6
9

2

Nice find. Thank you! – user9869932 Sep 22 '16 at 23:54
1

any tips on how to install wget2 on a mac? Site only documents how to install it from source and having trouble getting autopoint – Chris Jan 23 '18 at 22:49
1

In out TravisCI script we use use homebrew to install gettext (which includes autopoint). Have a look at .travis_setup.sh from the wget2 repo. – rockdaboot Jan 29 '18 at 14:01
2

Great! I like how this did recursive downloads, and worked with my existing `wget` command. If you have difficulty compiling wget2, an alternative might be to use [a docker image](https://www.google.de/search?q=docker+wget2). – joeytwiddle Sep 04 '20 at 11:38

Lord Loh. · Answer 6 · 2019-11-09T22:31:01.253

29

Another program that can do this is axel.

axel -n <NUMBER_OF_CONNECTIONS> URL

For baisic HTTP Auth,

axel -n <NUMBER_OF_CONNECTIONS> "user:password@https://domain.tld/path/file.ext"

Ubuntu man page.

edited Nov 09 '19 at 22:31

answered Jun 03 '15 at 05:54

Lord Loh.

2,437
7
39
64

6

this program allows unlimited numbers of connections which is very useful in some cases. – uglide Jul 14 '15 at 14:17
2

Great tool. for centos6.x i used http://rpm.pbone.net/index.php3/stat/4/idpl/16390122/dir/redhat_el_6/com/axel-2.4-1.el6.rf.x86_64.rpm.html – sherpaurgen Dec 28 '16 at 05:40
1

Axel cannot do HTTP basic auth :( – rustyx Nov 08 '19 at 20:08
2

I usually do `axel -n 4 "user:pasword@http://domain.tld/path/file.ext"` – Lord Loh. Nov 09 '19 at 14:13
1

can i use axel for recursively download a whole folder? – Ryan Arief Jan 20 '20 at 01:11
1

@RyanArief - I am not aware of any way. I do not see any recursive option in the `--help` listing either – Lord Loh. Jan 30 '20 at 23:53

score 17 · Answer 7 · edited Oct 01 '15 at 14:05

17

I strongly suggest to use httrack.

ex: httrack -v -w http://example.com/

It will do a mirror with 8 simultaneous connections as default. Httrack has a tons of options where to play. Have a look.

edited Oct 01 '15 at 14:05

rr-

14,303
6
45
67

answered Apr 13 '13 at 14:31

Rodrigo Bustos L.

203
2
2

@aaa90210: It'd be great if you'd have succinctly explained the programs deficiencies. ArturBodera's comment is much more informative. – Richard Apr 29 '16 at 17:44
1

@ArturBodera You can add cookies.txt file to the folder your are running your program from and it will automatically add those cookies to the download header. – Bertoncelj1 Jun 02 '18 at 13:11
httrack does not support following redirects – Chris Hunt Oct 17 '18 at 02:30

score 11 · Answer 8 · answered Aug 31 '13 at 17:57

As other posters have mentioned, I'd suggest you have a look at aria2. From the Ubuntu man page for version 1.16.1:

aria2 is a utility for downloading files. The supported protocols are HTTP(S), FTP, BitTorrent, and Metalink. aria2 can download a file from multiple sources/protocols and tries to utilize your maximum download bandwidth. It supports downloading a file from HTTP(S)/FTP and BitTorrent at the same time, while the data downloaded from HTTP(S)/FTP is uploaded to the BitTorrent swarm. Using Metalink's chunk checksums, aria2 automatically validates chunks of data while downloading a file like BitTorrent.

You can use the -x flag to specify the maximum number of connections per server (default: 1):

aria2c -x 16 [url]

If the same file is available from multiple locations, you can choose to download from all of them. Use the -j flag to specify the maximum number of parallel downloads for every static URI (default: 5).

aria2c -j 5 [url] [url2]

Have a look at http://aria2.sourceforge.net/ for more information. For usage information, the man page is really descriptive and has a section on the bottom with usage examples. An online version can be found at http://aria2.sourceforge.net/manual/en/html/README.html.

score 7 · Answer 9 · answered Mar 15 '11 at 08:08

7

wget cant download in multiple connections, instead you can try to user other program like aria2.

answered Mar 15 '11 at 08:08

user181677

795
2
12
15

score 5 · Answer 10 · answered Jul 16 '12 at 07:44

5

try pcurl

http://sourceforge.net/projects/pcurl/

uses curl instead of wget, downloads in 10 segments in parallel.

answered Jul 16 '12 at 07:44

Rumble

67
1
1

score 5 · Answer 11 · answered Dec 17 '18 at 07:55

use

aria2c -x 10 -i websites.txt >/dev/null 2>/dev/null &

in websites.txt put 1 url per line, example:

https://www.example.com/1.mp4
https://www.example.com/2.mp4
https://www.example.com/3.mp4
https://www.example.com/4.mp4
https://www.example.com/5.mp4

score 4 · Answer 12 · answered Aug 14 '17 at 22:19

They always say it depends but when it comes to mirroring a website The best exists httrack. It is super fast and easy to work. The only downside is it's so called support forum but you can find your way using official documentation. It has both GUI and CLI interface and it Supports cookies just read the docs This is the best.(Be cureful with this tool you can download the whole web on your harddrive)

httrack -c8 [url]

By default maximum number of simultaneous connections limited to 8 to avoid server overload

score 4 · Answer 13 · answered Jul 28 '18 at 02:57

use xargs to make wget working in multiple file in parallel

#!/bin/bash

mywget()
{
    wget "$1"
}

export -f mywget

# run wget in parallel using 8 thread/connection
xargs -P 8 -n 1 -I {} bash -c "mywget '{}'" < list_urls.txt

Aria2 options, The right way working with file smaller than 20mb

aria2c -k 2M -x 10 -s 10 [url]

-k 2M split file into 2mb chunk

-k or --min-split-size has default value of 20mb, if you not set this option and file under 20mb it will only run in single connection no matter what value of -x or -s

score 4 · Answer 14 · answered Mar 05 '21 at 19:05

You can use xargs

-P is the number of processes, for example, if set -P 4, four links will be downloaded at the same time, if set to -P 0, xargs will launch as many processes as possible and all of the links will be downloaded.

cat links.txt | xargs -P 4 -I{} wget {}

Pratik Balar · Answer 15 · 2022-04-12T10:04:57.193

I'm using gnu parallel

cat listoflinks.txt | parallel --bar -j ${MAX_PARALLEL:-$(nproc)} wget -nv {}

cat will pipe a list of line separated URLs to parallel
--bar flag will show parallel execution progress bar
MAX_PARALLEL env var is for maximum no of parallel download, use it carefully, default here is current no of CPUs

tip: use --dry-run to see what will happen if you execute command.
cat listoflinks.txt | parallel --dry-run --bar -j ${MAX_PARALLEL} wget -nv {}

score 3 · Answer 16 · answered Oct 15 '15 at 06:26

make can be parallelised easily (e.g., make -j 4). For example, here's a simple Makefile I'm using to download files in parallel using wget:

BASE=http://www.somewhere.com/path/to
FILES=$(shell awk '{printf "%s.ext\n", $$1}' filelist.txt)
LOG=download.log

all: $(FILES)
    echo $(FILES)

%.ext:
    wget -N -a $(LOG) $(BASE)/$@

.PHONY: all
default: all

mgutt · Answer 17 · 2019-12-12T08:51:19.510

Consider using Regular Expressions or FTP Globbing. By that you could start wget multiple times with different groups of filename starting characters depending on their frequency of occurrence.

This is for example how I sync a folder between two NAS:

wget --recursive --level 0 --no-host-directories --cut-dirs=2 --no-verbose --timestamping --backups=0 --bind-address=10.0.0.10 --user=<ftp_user> --password=<ftp_password> "ftp://10.0.0.100/foo/bar/[0-9a-hA-H]*" --directory-prefix=/volume1/foo &
wget --recursive --level 0 --no-host-directories --cut-dirs=2 --no-verbose --timestamping --backups=0 --bind-address=10.0.0.11 --user=<ftp_user> --password=<ftp_password> "ftp://10.0.0.100/foo/bar/[!0-9a-hA-H]*" --directory-prefix=/volume1/foo &

The first wget syncs all files/folders starting with 0, 1, 2... F, G, H and the second thread syncs everything else.

This was the easiest way to sync between a NAS with one 10G ethernet port (10.0.0.100) and a NAS with two 1G ethernet ports (10.0.0.10 and 10.0.0.11). I bound the two wget threads through --bind-address to the different ethernet ports and called them parallel by putting & at the end of each line. By that I was able to copy huge files with 2x 100 MB/s = 200 MB/s in total.

score 2 · Answer 18 · answered Sep 27 '20 at 12:03

Call Wget for each link and set it to run in background.

I tried this Python code

with open('links.txt', 'r')as f1:      # Opens links.txt file with read mode
  list_1 = f1.read().splitlines()      # Get every line in links.txt
for i in list_1:                       # Iteration over each link
  !wget "$i" -bq                       # Call wget with background mode

Parameters :

      b - Run in Background
      q - Quiet mode (No Output)

score 0 · Answer 19 · answered Oct 09 '22 at 18:28

If you are doing recursive downloads, where you don't know all of the URLs yet, wget is perfect.

If you already have a list of each URL you want to download, then skip down to cURL below.

Multiple Simultaneous Downloads Using Wget Recursively (unknown list of URLs)

# Multiple simultaneous donwloads

URL=ftp://ftp.example.com

for i in {1..10}; do
    wget --no-clobber --recursive "${URL}" &
done

The above loop will start 10 wget's, each recursively downloading from the same website, however they will not overlap or download the same file twice.

Using --no-clobber prevents each of the 10 wget processes from downloading the same file twice (including full relative URL path).

& forks each wget to the background, allowing you to run multiple simultaneous downloads from the same website using wget.

Multiple Simultaneous Downloads Using curl from a list of URLs

If you already have a list of URLs you want to download, curl -Z is parallelised curl, with a default of 50 downloads running at once.

However, for curl, the list has to be in this format:

url = https://example.com/1.html
-O
url = https://example.com/2.html
-O

So if you already have a list of URLs to download, simply format the list, and then run cURL

cat url_list.txt
#https://example.com/1.html
#https://example.com/2.html

touch url_list_formatted.txt

while read -r URL; do
    echo "url = ${URL}" >> url_list_formatted.txt
    echo "-O" >> url_list_formatted.txt
done < url_list.txt

Download in parallel using curl from list of URLs:

curl -Z --parallel-max 100 -K url_list_formatted.txt

For example,

$ curl -Z --parallel-max 100 -K url_list_formatted.txt
DL% UL%  Dled  Uled  Xfers  Live   Qd Total     Current  Left    Speed
100 --   2512     0     2     0     0  0:00:01  0:00:01 --:--:--  1973

$ ls
1.html  2.html  url_list_formatted.txt  url_list.txt

Multiple simultaneous downloads using Wget?

19 Answers19

Multiple Simultaneous Downloads Using Wget Recursively (unknown list of URLs)

Multiple Simultaneous Downloads Using curl from a list of URLs

Linked

Related