How to download an entire website given a domain name

Question

Some Context

After fixing the code of a website to use a CDN (rewriting all the URLs to images, JavaScript & CSS), I need to test all the pages on the domain to make sure all the resources are fetched from the CDN.

All the sites pages are accessible through links, no isolated pages.

Question

Is there some automated way to give a domain name and request all pages + resources of the domain?

Answer:

OK, I found I can use wget as so:

wget -p --no-cache -e robots=off -m -H -D cdn.domain.com,www.domain.com -o site1.log www.domain.com

Options explained:

-p - download resources too (images, CSS, JavaScript, etc.)
--no-cache - get the real object, do not return server cached object
-e robots=off - disregard robots and no-follow directions
-m - mirror site (follow links)
-H - span hosts (follow other domains too)
-D cdn.domain.com,www.domain.com - specify witch domains to follow, otherwise will follow every link from the page
-o site1.log - log to file site1.log
-U "Mozilla/5.0" - optional: fake the user agent - useful if server returns different data for different browser
www.domain.com - the site to download

Enjoy!

score 13 · Accepted Answer · edited May 23 '17 at 12:16

The wget documentation has this bit in it:

Actually, to download a single page and all its requisites (even if they exist on separate websites), and make sure the lot displays properly locally, this author likes to use a few options in addition to ‘-p’:
      wget -E -H -k -K -p http://site/document

The key is the -H option, which means --span-hosts -> go to foreign hosts when recursive. I don't know if this also stands for normal hyperlinks or only for resources, but you should try it out.

You can consider an alternate strategy. You don't need to download the resources to test that they are referenced from the CDN. You can just get the source code for the pages you're interested in (you can use wget, as you did, or curl, or something else) and either:

parse it using a library - which one depends on the language you're using for scripting. Check each <img />, <link /> and <script /> for CDN links.
use regexes to check that the resource urls contain the CDN domain. See this :), although in this limited case it might not be overly complicated.

You should also check all CSS files for url() links - they should also point to CDN images. Depending on the logic of your apllication, you may need to check that the JavaScript code does not create any images that do not come from the CDN.

Thank you for the detailed answer! The -H did solve my problem. I do want to download the files from the CDN to see they all are linked correctly, if they aren't wget will have an error. — SimonW, Oct 24 '12 at 12:00

How to download an entire website given a domain name

1 Answers1

Linked