Use href and target in download.file R?

Question

I have a snippet of code:

raw_prefix <- file.path("data", "raw")

fpa_prefix <- file.path(raw_prefix, "fpa-fod")

if(!dir.exists(fpa_prefix)){
  dir.create(fpa_prefix)
}

fpa_gdb <- file.path(fpa_prefix, "RDS-2013-0009.4_GDB", "Data", "FPA_FOD_20170508.gdb")

if (!file.exists(fpa_gdb)) {
  loc <- "https://www.fs.usda.gov/rds/fedora/objects/RDS:RDS-2013-0009.4/datastreams/RDS-2013-0009.4_GDB/content"
  dest <- paste0(fpa_prefix, ".zip")
  download.file(loc, dest)
  unzip(dest, exdir = fpa_prefix)
  unlink(dest)
  assert_that(file.exists(fpa_gdb))
}

Which works great with most websites to download files on the fly in the name of reproducible workflows, but there is one dataset that I need which has an "href" and "target" file making it very difficult to download using download.file().

The file is found (also in above code) here:

https://www.fs.usda.gov/rds/archive/Product/RDS-2013-0009.4/

Towards the bottom of the page is a file called

RDS-2013-0009.4_GDB.zip

which is the file I am trying to download using the above script.

If you inspect this element you will find this structure, which returns the correct file! But how to translate into R code?

<a href="//www.fs.usda.gov/rds/fedora/objects/RDS:RDS-2013-0009.4/datastreams/RDS-2013-0009.4_GDB/content" target="_blank">RDS-2013-0009.4_GDB.zip</a>

If anyone has an idea on how to download this file I would GREATLY appreciate it!

Thanks!

`target=` just instructs the browser to use a new tab/window/'session" — hrbrmstr, Sep 03 '17 at 23:22
Thanks for the clarification. But if I use just the "https://www.fs.usda.gov/rds/fedora/objects/RDS:RDS-2013-0009.4/datastreams/RDS-2013-0009.4_GDB/content" it doens't work within R. Using this in a web browser it is fine, but not in R... — nate-m, Sep 03 '17 at 23:27

hrbrmstr · Accepted Answer · 2017-09-03T23:43:07.827

3

This will:

find all the .zip links on the page (URLs and filenames)
go through each found and download them "like a browser would do"

Note that write_disk() won't overwrite existing files, so if downloads get interrupted, either delete the file or use overwrite=TRUE.

library(rvest)
library(httr)
library(purrr)

pg <- read_html("https://www.fs.usda.gov/rds/archive/Product/RDS-2013-0009.4/")

fils <- html_nodes(pg, xpath=".//dd[@class='product']//li/a[contains(., 'zip')]") 

walk2(html_attr(fils, 'href'),  html_text(fils), 
      ~GET(sprintf("https:%s", .x), write_disk(.y), progress()))

If you don't want to use purrr, this is all base R:

invisible(
  mapply(
    download.file, 
       url = sprintf("https:%s", html_attr(fils, 'href')),
       destfile = html_text(fils)
  )
)

edited Sep 03 '17 at 23:43

answered Sep 03 '17 at 23:35

hrbrmstr

77,368
11
139
205

Hi hrbrmstr, this is great! Thank you for the multiple examples, extremely helpful. I am not opposed to using purr, bu could you guide me a little through this though. What is ".x" and ".y"? Where are you writing out the files when using ".y"? Also, when in the "xpath=".//dd[@class='product']//li/a[contains(., 'zip')]") " section, could I specify "gdb" as a unique identifier? – nate-m Sep 04 '17 at 00:00
Nevermind, hbrrmstr. Figure it out where it write to when using ".y". But still cannot figure out what ".x" refers to.... – nate-m Sep 04 '17 at 00:33
1

`.x` and `.y` are variables in the implicitly defined anonymous function. `.x` is the URL itself and `.y` is the file name. – hrbrmstr Sep 04 '17 at 01:25
1

you can use `xpath=".//dd[@class='product']//li/a[contains(., 'zip') and contains(., 'GDB')]"` for targeting the GDB file – hrbrmstr Sep 04 '17 at 01:26
1

Thanks! That is very helpful. Appreciate all you help on this. – nate-m Sep 04 '17 at 14:13

Use href and target in download.file R?

1 Answers1

Linked