3

I run wget to create a warc archive as follows:

$ wget --warc-file=/tmp/epfl --recursive --level=1 http://www.epfl.ch/

$ l -h /tmp/epfl.warc.gz
-rw-r--r--  1 david  wheel   657K Sep  2 15:18 /tmp/epfl.warc.gz

$ find .
./www.epfl.ch/index.html
./www.epfl.ch/public/hp2013/css/homepage.70a623197f74.css
[...]

I only need the epfl.warc.gz file. How do I prevent wget to creating all the individual files?

I tried as follows:

$ wget --warc-file=/tmp/epfl --recursive --level=1 --output-document=/dev/null http://www.epfl.ch/
ERROR: -k or -r can be used together with -O only if outputting to a regular file.
David Portabella
  • 12,390
  • 27
  • 101
  • 182

2 Answers2

2

tl;dr Add the options --delete-after and --no-directories.

Option --delete-after instructs wget to delete each downloaded file immediately after its download is complete. As a consequence, the maximum disk usage during execution will be the size of the WARC file plus the size of the single largest downloaded file.

Option --no-directories prevents wget from leaving behind a useless tree of empty directories. By default wget creates a directory tree that mirrors the one on the host, and downloads each file into the appropriate directory of the mirrored tree. wget does this even when the downloaded file is temporary due to --delete-after. To prevent that, use option --no-directories.

The below demonstrates the result, using your given example (slightly altered).

$ cd $(mktemp -d)
$ wget --delete-after --no-directories \
  --warc-file=epfl --recursive --level=1 http://www.epfl.ch/
...
Total wall clock time: 12s
Downloaded: 22 files, 1.4M in 5.9s (239 KB/s)
$ ls -lhA
-rw-rw-r--. 1 chadv chadv 1.5M Aug 31 07:55 epfl.warc

If you forget to use --no-directories, you can easily clean up the tree of empty directories with find -type d -delete.

Chadversary
  • 852
  • 8
  • 11
0

For individual files (without --recursive) the option -O /dev/null will make wget not to create a file for the output. For recursive fetches /dev/null is not accepted (don't know why). But why not just write all the output concatenated into one single file via -O tmpfile and delete this file afterwards?

Sebastian Nagel
  • 2,049
  • 10
  • 10
  • sure. i was only asking if there was a way to avoid duplicating the data. – David Portabella Sep 16 '16 at 14:24
  • A problem with `-O tmpfile` is that the total disk usage will be twice as large as needed (assuming that the tmpfile and WARC file are approximately the same size). See my suggested solution, which avoids this problem. – Chadversary Aug 31 '18 at 15:22