0

I'm trying to use download.file to get some webpages including embedded images, etc. I think using wget it's the equivalent of the -p -k options, but I can't see how to do this...

if I do:

download.file("http://guardian.co.uk","test.html")

That obviously works, but I get this error:

Warning messages:
1: running command 'wget -p -k "http://guardian.co.uk" -O "test.html"' had status 1 
2: In download.file("http://guardian.co.uk", "test.html", method = "wget",  :
  download had nonzero exit status

When I run:

download.file("http://guardian.co.uk","test.html", method = "wget", extra = "-p -k") #no recursion (-r), but get pre-requisites, and (-k) convert for local viewing

I've done Sys.which("wget") & the path is set (and I'm not trying to access https which I think can cause issues).

Once I've done this I actually want to put it into a loop where I download a set of urls (& their embedded content) to create a single html output...

sjgknight
  • 393
  • 1
  • 5
  • 19
  • On my Linux, your command throws `Cannot specify both -k and -O if multiple URLs are given, or in combination with -p or -r. See the manual for details`. i.e. `wget -p -k` is not valid on my Linux. –  Nov 13 '15 at 02:50
  • Ah interesting, so I read advice (e.g. here http://stackoverflow.com/questions/6348289/download-a-working-local-copy-of-a-webpage ) indicating that was how to get a local copy, when I use them `"-k -p"` I get the same error as you, but otherwise not. Removing `-k` means the download works but I don't load the full page (and actually, it looks like the download isn't saving the images). – sjgknight Nov 13 '15 at 03:37
  • Plus in the manual: https://www.gnu.org/software/wget/manual/html_node/Recursive-Retrieval-Options.html – sjgknight Nov 13 '15 at 03:41
  • "download a set of urls (& their embedded content) to create a single html output": is that just a convenience? It won't be well formed HTML? ......? – psychemedia Nov 13 '15 at 09:06
  • I could strip those @psychemedia (I want to dld a book split across multiple pages, it's CC licensed but not convenient, perhaps to encourage purchase, I don't care if html or pdf but I'd like to do it in R...) – sjgknight Nov 13 '15 at 10:36

1 Answers1

0

Easy solution, just use system to call wget directly:

system("wget http://guardian.co.uk -p -k")

I think the issue is that passing an output file ('test.html') means -O option specified, so you can't also invoke -r -k whereas calling wget directly means it saves the files separately.

sjgknight
  • 393
  • 1
  • 5
  • 19