9

I would like to save a web page programmatically.

I don't mean merely save the HTML. I would also like automatically to store all associated files (images, CSS files, maybe embedded SWF, etc), and hopefully rewrite the links for local browsing.

The intended usage is a personal bookmarks application, in which link content is cached in case the original copy is taken down.

Cœur
  • 37,241
  • 25
  • 195
  • 267
Joseph Turian
  • 15,430
  • 14
  • 47
  • 62

3 Answers3

7

Take a look at wget, specifically the -p flag

−p  −−page−requisites
This option causes Wget to download all the files
that are necessary to properly display
a givenHTML  page. Thisincludes such
things as inlined images, sounds, and
referenced stylesheets.

The following command:

wget -p http://<site>/1.html

Will download page.html and all files it requires.

Josh
  • 10,961
  • 11
  • 65
  • 108
  • And why did someone downvote me? I mean the -1 doesn't bother me so much as I'd like to correct any issues there might be with my answer... – Josh Nov 14 '09 at 14:26
  • This looks pretty good, except sometimes the output doesn't look the same as the page that I copied. For example, I tried to 'wget -p' http://ffffound.com/image/3d3795b5447291980a40f3719dea4b5b15ff3ec9 However, the related images which are laid out as a horizontal list, now become a long vertical list, one-per-line. Why? – Joseph Turian Nov 16 '09 at 07:18
2

On Windows: you can run IE as a com object and pull everything out.

On other thing, you can take the source of Mozilla.

In Java, Lobo.

Or commons-httpclient and write a lot of code.

bmargulies
  • 97,814
  • 39
  • 186
  • 310
  • +1 if you need stuff like background images referenced in stylesheets and CSS imports, you need a real-world HTML and CSS parser. That's half a browser there already, so you might as well just do it with a real browser. Easiest to embed IE, or work as a Firefox extension. – bobince Nov 13 '09 at 22:41
0

You could try the MHTML format (which is what IE uses). http://en.wikipedia.org/wiki/MHTML

In other words, you'd be downloading each object (image, css, etc.) to your computer, and then "embedding" them, via Base64, into a single file.

Michael Todd
  • 16,679
  • 4
  • 49
  • 69