Why does wget only download the index.html for some websites?

Question

I'm trying to use wget command:

wget -p http://www.example.com

to fetch all the files on the main page. For some websites it works but in most of the cases, it only download the index.html. I've tried the wget -r command but it doesn't work. Any one knows how to fetch all the files on a page, or just give me a list of files and corresponding urls on the page?

How does this differ from your [previous question](http://stackoverflow.com/questions/11123477/how-to-get-a-list-of-all-paths-files-on-a-webpage-using-wget-or-curl-in-php)? If it's the same problem, edit your old question to clarify it. — Emil Vikström, Jun 20 '12 at 17:07
Possible duplicate of [how to get a list of all paths/files on a webpage using wget or curl in php?](https://stackoverflow.com/questions/11123477/how-to-get-a-list-of-all-paths-files-on-a-webpage-using-wget-or-curl-in-php) — H H, Sep 01 '17 at 09:32

score 108 · Answer 1 · edited Jun 09 '23 at 15:58

108

Wget is also able to download an entire website. But because this can put a heavy load upon the server, wget will obey the robots.txt file.

wget -r -p http://www.example.com

The -p parameter tells wget to include all files, including images. This will mean that all of the HTML files will look how they should do.

So what if you don't want wget to obey by the robots.txt file? You can simply add -e robots=off to the command like this:

wget -r -p -e robots=off http://www.example.com

As many sites will not let you download the entire site, they will check your browsers identity. To get around this, use -U mozilla as I explained above.

wget -r -p -e robots=off -U mozilla http://www.example.com

A lot of the website owners will not like the fact that you are downloading their entire site. If the server sees that you are downloading a large amount of files, it may automatically add you to it's black list. The way around this is to wait a few seconds after every download. The way to do this using wget is by including --wait=X (where X is the amount of seconds.)

you can also use the parameter: --random-wait to let wget chose a random number of seconds to wait. To include this into the command:

wget --random-wait -r -p -e robots=off -U mozilla http://www.example.com

edited Jun 09 '23 at 15:58

Benjamin Loison

3,782
4
16
33

answered Jun 20 '12 at 17:10

Ritesh Chandora

8,382
5
21
38

5

Thanks for your answer. I tried the 3 ways you mentioned on some common urls (amazon.com for example) but what I get was only the index.html. Do you have any other suggestion? – Jay H Jun 20 '12 at 17:28
4

same here. Only index.html – BigSack Mar 24 '13 at 07:55
6

@JayH try to not use an address that will be redirected. I.E if you use ```http://amazon.com``` it will not work because you'll be redirected to www.amazon.com, but if you'll use ```http://www.amazon.com``` it will start to download all the site. The ability to not to follow robots file is something not so "polite" so it will not work as good as you can imagine. – Stefano Falsetto Aug 24 '14 at 22:47
2

I hate how the most valued answer is at the bottom of the site – user4757174 Apr 14 '17 at 16:17
1

In addition to `--random-wait` also can be used parameter `-w X` where `X` is a time in seconds and it is used as a base value to calculate what the random wait times will be. – S.I. Nov 03 '17 at 13:14

Alf Eaton · Answer 2 · 2015-05-15T09:20:39.547

43

Firstly, to clarify the question, the aim is to download index.html plus all the requisite parts of that page (images, etc). The -p option is equivalent to --page-requisites.

The reason the page requisites are not always downloaded is that they are often hosted on a different domain from the original page (a CDN, for example). By default, wget refuses to visit other hosts, so you need to enable host spanning with the --span-hosts option.

wget --page-requisites --span-hosts 'http://www.amazon.com/'

If you need to be able to load index.html and have all the page requisites load from the local version, you'll need to add the --convert-links option, so that URLs in img src attributes (for example) are rewritten to relative URLs pointing to the local versions.

Optionally, you might also want to save all the files under a single "host" directory by adding the --no-host-directories option, or save all the files in a single, flat directory by adding the --no-directories option.

Using --no-directories will result in lots of files being downloaded to the current directory, so you probably want to specify a folder name for the output files, using --directory-prefix.

wget --page-requisites --span-hosts --convert-links --no-directories --directory-prefix=output 'http://www.amazon.com/'

edited May 15 '15 at 09:20

answered Aug 22 '14 at 09:45

Alf Eaton

5,226
4
45
50

Thanks for the valuable answer. Can you please add some extra info to make it more general before I can award you the bounty. Like, for example, `http://indiabix.com/civil-engineering/questions-and-answers/` under this link , I want wget to visit each category/chapter and download all the images,from every page in every sections (on left sidebar). Notice, by images,I mean all images including the images of Math formulae involved in the questions. *Problem is that the download stops after downloading index.html. A working example for this case would be great !!!* – Naveen Aug 23 '14 at 19:00
@InsaneCoder You might want to start a separate question for that and show what you've tried, as recursive fetching is a whole other set of problems, and (as I understand it) isn't what the original question was asking about. – Alf Eaton Aug 26 '14 at 07:24
@InsaneCoder Adding the `--mirror` option is the most straightforward, and might be enough for your needs. – Alf Eaton Aug 26 '14 at 07:36
Be careful to use `--span-hosts`, add `-D` to limit spanning to certain domains. – Evan Hu Sep 15 '16 at 12:02
@EvanHu Adding a whitelist of domains wouldn't help here, as wget needs to be able to fetch the page requisites wherever they're hosted. – Alf Eaton Sep 20 '16 at 15:57
@AlfEaton thanks for your concern. Can you try `wget -rkEpHN -e robots=off -U mozilla http://www.yinwang.org/` and `wget -rkEpHN -Dyinwang.org -e robots=off -U mozilla http://www.yinwang.org/` and check the results? – Evan Hu Sep 22 '16 at 13:12
@EvanHu Those commands are using the `-r` (recursive) flag, so are not relevant to this question/answer. – Alf Eaton Sep 26 '16 at 10:18
`--span-hosts` along with `--domain=` saved me. I had a website with images on ststic subdomain, so wget couldn't retrive them – vladkras Nov 02 '16 at 14:36

score 8 · Answer 3 · edited Aug 25 '14 at 01:32

The link you have provided is the homepage or /index.html, Therefore it's clear that you are getting only a index.html page. For an actual download, for example, for "test.zip" file, you need to add the exact file name at the end. For example use the following link to download test.zip file:

wget -p domainname.com/test.zip

Download a Full Website Using wget --mirror

Following is the command line which you want to execute when you want to download a full website and made available for local viewing.

wget --mirror -p --convert-links -P ./LOCAL-DIR http://www.example.com

–mirror: turn on options suitable for mirroring.
-p: download all files that are necessary to properly display a given HTML page.
–convert-links: after the download, convert the links in document for local viewing.
-P ./LOCAL-DIR: save all the files and directories to the specified directory

Download Only Certain File Types Using wget -r -A

You can use this under following situations:

Download all images from a website,
Download all videos from a website,
Download all PDF files from a website

wget -r -A.pdf http://example.com/test.pdf

Thanks. The `--mirror` option was what finally made it work for me. — vergenzt, Jul 18 '15 at 13:47

score 5 · Answer 4 · answered Aug 30 '15 at 20:43

5

Another problem might be that the site you're mirroring uses links without www. So if you specify

wget -p -r http://www.example.com

it won't download any linked (intern) pages because they are from a "different" domain. If this is the case then use

wget -p -r http://example.com

instead (without www).

answered Aug 30 '15 at 20:43

jor

2,058
2
26
46

2

Correct observation. This is the root cause for my problem. – Evan Hu Sep 15 '16 at 12:05

score 5 · Answer 5 · answered Mar 21 '19 at 23:55

I had the same problem downloading files of CFSv2 model. I solved it using mixing of the above answers, but adding the parameter --no-check-certificate

wget -nH --cut-dirs=2 -p -e robots=off --random-wait -c -r -l 1 -A "flxf*.grb2" -U Mozilla --no-check-certificate https://nomads.ncdc.noaa.gov/modeldata/cfsv2_forecast_6-hourly_9mon_flxf/2018/201801/20180101/2018010100/

Here a brief explanation of every parameter used, for a further explanation go to the GNU wget 1.2 Manual

-nH equivalent to --no-host-directories: Disable generation of host-prefixed directories. In this case, avoid the generation of the directory ./https://nomads.ncdc.noaa.gov/
--cut-dirs=<number>: Ignore directory components. In this case, avoid the generation of the directories ./modeldata/cfsv2_forecast_6-hourly_9mon_flxf/
-p equivalent to --page-requisites: This option causes Wget to download all the files that are necessary to properly display a given HTML page. This includes such things as inlined images, sounds, and referenced stylesheets.
-e robots=off: avoid download robots.txt file
-random-wait: Causes the time between the request to vary between 0.5 and 1.5 * seconds, where was specified using the --wait option.
-c equivalent to --continue: continue getting a partially-downloaded file.
-r equivalent to --recursive: Turn on recursive retrieving. The default maximum depth is 5
-l <depth> equivalent to --level <depth>: Specify recursion maximum depth level
-A <acclist> equivalent to --accept <acclist>: specify a comma-separated list of the name suffixes or patterns to accept.
-U <agent-string> equivalent to --user-agent=<agent-string>: The HTTP protocol allows the clients to identify themselves using a User-Agent header field. This enables distinguishing the WWW software, usually for statistical purposes or for tracing of protocol violations. Wget normally identifies as ‘Wget/version’, the version being the current version number of Wget.
--no-check-certificate: Don't check the server certificate against the available certificate authorities.

It was the -e robots=off that did it for me! – Shawn May 16 '20 at 15:21 — Shawn, May 16 '20 at 15:21

score 3 · Answer 6 · answered Mar 10 '14 at 05:54

3

I know that this thread is old, but try what is mentioned by Ritesh with:

--no-cookies

It worked for me!

answered Mar 10 '14 at 05:54

Joshua

175
2
9

score 1 · Answer 7 · answered Nov 17 '13 at 00:24

1

If you look for index.html in the wget manual you can find an option --default-page=name which is index.html by default. You can change to index.php for example.

--default-page=index.php

answered Nov 17 '13 at 00:24

adrianTNT

3,671
5
29
35

score 1 · Answer 8 · answered Sep 01 '17 at 09:29

If you only get the index.html and that file looks like it only contains binary data (i.e. no readable text, only control characters), then the site is probably sending the data using gzip compression.

You can confirm this by running cat index.html | gunzip to see if it outputs readable HTML.

If this is the case, then wget's recursive feature (-r) won't work. There is a patch for wget to work with gzip compressed data, but it doesn't seem to be in the standard release yet.

Why does wget only download the index.html for some websites?

8 Answers8

Linked