How do you archive an entire website for offline viewing?

Question

We actually have burned static/archived copies of our asp.net websites for customers many times. We have used WebZip until now but we have had endless problems with crashes, downloaded pages not being re-linked correctly, etc.

We basically need an application that crawls and downloads static copies of everything on our asp.net website (pages, images, documents, css, etc) and then processes the downloaded pages so that they can be browsed locally without an internet connection (get rid of absolute urls in links, etc). The more idiot proof the better. This seems like a pretty common and (relatively) simple process but I have tried a few other applications and have been really unimpressed

Does anyone have archive software they would recommend? Does anyone have a really simple process they would share?

Check out https://archivebox.io, it's an open-source, self-hosted tool that creates a local, static, browsable HTML clone of websites (it saves HTML, JS, media files, PDFs, screenshot, static assets and more). — Nick Sweeting, Feb 01 '19 at 01:21

score 68 · Answer 1 · edited Feb 14 '17 at 17:40

68

You could use wget:

wget -m -k -K -E http://url/of/web/site

edited Feb 14 '17 at 17:40

FelixSFD

6,052
10
43
117

answered Feb 11 '09 at 21:25

chuckg

9,195
7
28
26

2

From the --help, I can see what the rest do, but what do the flags K (capital) and E do? – スーパーファミコン Feb 11 '09 at 21:34
4

Don't forget the -p switch to get images and other embedded objects, too. (-E is for converting to html extension. -K is to backup the original file with extension .orig) – migu Feb 11 '09 at 21:49
20

The longer, but less cryptic version: `wget --mirror --convert-links --backup-converted --adjust-extension http://url/of/web/site ` – jgillman Feb 26 '15 at 23:56
5

For me this just gets the index.html – Alper Oct 26 '15 at 15:36
1

Yes, for me too, it only retrieves index.html. And the squarespace site I'm trying to get retrieve locally from keeps giving me error 429 "Too Many Requests". :( I've event setup rate limiting and wait. – Michael R Nov 07 '15 at 20:20
not working if links are with javascrpts. Javascript wget do not support. – jmp Nov 23 '15 at 23:08
For me, HTTrack worked a lot better. I archived a really old php page and all img tags pointed to a php file with query params. HTTrack renamed them to .jpg / .png files and adjusted the img tag accordingly. – Felix Ebert Oct 23 '16 at 21:15

Jesse Dearing · Accepted Answer · 2009-02-11T21:40:40.123

43

In Windows, you can look at HTTrack. It's very configurable allowing you to set the speed of the downloads. But you can just point it at a website and run it too with no configuration at all.

In my experience it's been a really good tool and works well. Some of the things I like about HTTrack are:

Open Source license
Resumes stopped downloads
Can update an existing archive
You can configure it to be non-aggressive when it downloads so it doesn't waste your bandwidth and the bandwidth of the site.

edited Feb 11 '09 at 21:40

answered Feb 11 '09 at 21:34

Jesse Dearing

2,251
18
20

3

httrack also exists for linux. – dusoft Jan 10 '10 at 23:01
5

It also exists for Mac - `brew install httrack` – timothyclifford Jun 19 '16 at 03:21

score 7 · Answer 3 · answered Nov 02 '15 at 01:07

The Wayback Machine Downloader by hartator is simple and fast.

Install via Ruby, then run with the desired domain and optional timestamp from the Internet Archive.

sudo gem install wayback_machine_downloader
mkdir example
cd example
wayback_machine_downloader http://example.com --timestamp 19700101000000

score 4 · Answer 4 · answered Feb 11 '09 at 21:26

4

I use Blue Crab on OSX and WebCopier on Windows.

answered Feb 11 '09 at 21:26

Syntax

1,314
1
7
14

Blue Crab is a pretty damn crashy app today. – 2540625 Nov 01 '15 at 23:57

score 2 · Answer 5 · answered Feb 11 '09 at 21:26

2

wget -r -k

... and investigate the rest of the options. I hope you've followed these guidelines:http://www.w3.org/Protocols/rfc2616/rfc2616-sec9.html so all your resources are safe with GET requests.

answered Feb 11 '09 at 21:26

Joel Hoffman

346
1
3

user1011743 · Answer 6 · 2018-09-17T17:36:27.543

1

For OS X users, I've found the sitesucker application found here works well without configuring anything but how deep it follows links.

edited Sep 17 '18 at 17:36

answered Apr 24 '13 at 14:24

user1011743

19
3

sitesucker.us website doesn't load as of Jan 2018. – Greg Dubicki Jan 18 '18 at 20:20
works again but changed url to where sitesucker.us was redirecting to the author's development website ricks-apps.com – user1011743 Sep 17 '18 at 17:40

score 1 · Answer 7 · edited Nov 01 '15 at 23:40

1

If your customers are archiving for compliance issues, you want to ensure that the content can be authenticated. The options listed are fine for simple viewing, but they aren't legally admissible. In that case, you're looking for timestamps and digital signatures. Much more complicated if you're doing it yourself. I'd suggest a service such as PageFreezer.

edited Nov 01 '15 at 23:40

2540625

11,022
8
52
58

answered Mar 09 '15 at 18:23

Dieghito

665
1
12
24

score 1 · Answer 8 · answered Feb 11 '09 at 21:25

1

I just use: wget -m <url>.

answered Feb 11 '09 at 21:25

Aram Verstegen

2,407
1
17
15

1

This only gets the home page, not the whole site. – 2540625 Nov 02 '15 at 00:07

score 0 · Answer 9 · answered Feb 11 '09 at 21:58

I've been using HTTrack for several years now. It handles all of the inter-page linking, etc. just fine. My only complaint is that I haven't found a good way to keep it limited to a sub-site very well. For instance, if there is a site www.foo.com/steve that I want to archive, it will likely follow links to www.foo.com/rowe and archive that too. Otherwise it's great. Highly configurable and reliable.

How do you archive an entire website for offline viewing?

9 Answers9

Linked