24

I need to archive complete pages including any linked images etc. on my linux server. Looking for the best solution. Is there a way to save all assets and then relink them all to work in the same directory?

I've thought about using curl, but I'm unsure of how to do all of this. Also, will I maybe need PHP-DOM?

Is there a way to use firefox on the server and copy the temp files after the address has been loaded or similar?

Any and all input welcome.

Edit:

It seems as though wget is 'not' going to work as the files need to be rendered. I have firefox installed on the server, is there a way to load the url in firefox and then grab the temp files and clear the temp files after?

Tomas
  • 3,054
  • 5
  • 27
  • 39
  • Is all the content static, or is there dynamic content as well (PHP, JSP etc)? – thkala Jan 22 '11 at 17:46
  • This is part of a client web app so there could be anything. It would be best to even use javascript or java or similar to send the current browser state to the server and then do what else is needed. – Tomas Jan 22 '11 at 17:48
  • 1
    There are other alternatives in [get a browser rendered html+javascript](https://stackoverflow.com/q/18720218) – Lucas Cimon Jan 10 '14 at 19:32
  • The command `wget -p http://example.com` saves necessary pages and objects, but unfortunately it does not change pathes. – SuB Sep 25 '16 at 16:24
  • I found this question very useful! – Student May 15 '19 at 14:17
  • did you manage to find the best way? `wget` is clearly much better than `curl`, but i couldn't find any settings which i would call "best". plus, in practice, there's just too much it misses out (can't download javascript generated content, some pages still look broken, etc). not to mention the downloaded pages are not compatible with future downloaded pages from the same site, for instance. – cregox Nov 30 '20 at 11:11
  • also, in this time and age, i would hope for the downloader to be able to watch for big images and videos, download arbitrary sizes of both and, above all, describe them with good sentences using ai for cv and for crowd context (a cuddling cat video on youtube could bring up the thumbnail with a description such as "cuddling yellow cat wakes up sneezing dog") as to **save the essence** of every page without taking too much space for better longevity. – cregox Nov 30 '20 at 11:18

5 Answers5

26

wget can do that, for example:

wget -r http://example.com/

This will mirror the whole example.com site.

Some interesting options are:

-Dexample.com: do not follow links of other domains
--html-extension: renames pages with text/html content-type to .html

Manual: http://www.gnu.org/software/wget/manual/

Arnaud Le Blanc
  • 98,321
  • 23
  • 206
  • 194
  • 3
    Guys/gals, wget is getting the complete site. I want to give it a single page and just get that page's content. Am I missing something here? – Tomas Jan 22 '11 at 18:34
  • 1
    use `-l 1`; it will limit the mirroring to 1 level – Arnaud Le Blanc Jan 22 '11 at 19:41
  • 1
    `wget -m` which is currently equivalent to `-r -N -l inf --no-remove-listing` – mb21 Jul 11 '14 at 12:54
  • 1
    `--html-extension` will be deprecated from version 1.12 on and `--adjust-extension` should be used. _As of version 1.12, Wget will also ensure that any downloaded files of type text/css end in the suffix .css, and the option was renamed from --html-extension, to better reflect its new behavior. The old option name is still acceptable, but should now be considered deprecated._ – dennis Feb 09 '18 at 07:54
  • 2
    wget has become total unuseable for this. – Lothar Jul 27 '18 at 16:19
15

Use following command:

wget -E -k -p http://yoursite.com

Use -E to adjust extensions. Use -k to convert links to load the page from your storage. Use -p to download all objects inside the page.

Please note that this command does not download other pages hyperlinked in the specified page. It means that this command only download objects required to load the specified page properly.

Benjamin Loison
  • 3,782
  • 4
  • 16
  • 33
SuB
  • 2,250
  • 3
  • 22
  • 37
  • I tried to backup some blogs I like, but the images are usually hosted outside the domain I tried to get. For example, in "theblogilike.com" lots of photos are stored in "someotherdomain.com/photos".. Is it possible to get around this? – Student May 15 '19 at 14:46
  • I found a solution! By adding the option "--span-host", wget will fetch data from the domains linked by the specified domain. – Student May 16 '19 at 02:21
  • Thanks. Is there any way to limit depth (as in `-l 1`) only for external hosts? – Evan Jo Jan 25 '23 at 18:51
6

If all the content in the web page was static, you could get around this issue with something like wget:

$ wget -r -l 10 -p http://my.web.page.com/

or some variation thereof.

Since you also have dynamic pages, you cannot in general archive such a web page using wget or any simple HTTP client. A proper archive needs to incorporate the contents of the backend database and any server-side scripts. That means that the only way to do this properly is to copy the backing server-side files. That includes at least the HTTP server document root and any database files.

EDIT:

As a work-around, you could modify your webpage so that a suitably priviledged user could download all the server-side files, as well as a text-mode dump of the backing database (e.g. an SQL dump). You should take extreme care to avoid opening any security holes through this archiving system.

If you are using a virtual hosting provider, most of them provide some kind of Web interface that allows backing-up the whole site. If you use an actual server, there is a large number of back-up solutions that you could install, including a few Web-based ones for hosted sites.

thkala
  • 84,049
  • 23
  • 157
  • 201
4

What's the best way to save a complete webpage on a linux server?

I tried couple of tools curl, wget included but nothing works up to my expectations.

Finally I found a tool to save a complete webpage (images, scripts, linked pages.... everything included). Its written in rust named monolith. Take a look.

It do not save images and other scripts/ stylesheets as separate files but pack them in 1 html file.

For example

If I had to save https://nodejs.org/en/docs/es6 to a es6.html with all page requisites packed in one file then I had to run:

monolith https://nodejs.org/en/docs/es6 -o es6.html
Parampreet Rai
  • 154
  • 1
  • 12
2
wget -r http://yoursite.com

Should be sufficient and grab images/media. There are plenty of options you can feed it.

Note: I believe wget nor any other program supports downloading images specified through CSS - so you may need to do that yourself manually.

Here may be some useful arguments: http://www.linuxjournal.com/content/downloading-entire-web-site-wget

meder omuraliev
  • 183,342
  • 71
  • 393
  • 434
  • `wget` downloads any image either inside HTML or CSS when is used with `-p` switch. – SuB Apr 03 '17 at 10:56