The completely useless "please give us your email" page introduces a condition where we have to maintain state for any further navigation or downloading, and start by going to the page with the registration form and scraping it to get an "authenticator token" from the page to pass on to the next request (ostensibly for security purposes):
library(curlconverter)
library(xml2)
library(httr)
library(rvest)
pg <- read_html("https://www.cpc.unc.edu/projects/china/data/datasets/data-downloads-registration")
html_nodes(pg, "input[name='_authenticator']") %>%
html_attr("value") -> authenticator
I looked at the POST
request the form makes using curlconverter
(look on SO for how to use it or read that GitLab project site) and came up with:
httr::POST(
url = "https://www.cpc.unc.edu/projects/china/data/datasets/data-downloads-registration",
httr::add_headers(
`User-Agent` = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:63.0) Gecko/20100101 Firefox/63.0",
Referer = "https://www.cpc.unc.edu/projects/china/data/datasets/data-downloads-registration"
),
httr::set_cookies(`restriction-/projects/china/data/datasets/data_downloads` = "/projects/china/data/datasets/data_downloads"),
body = list(
`first-name` = "Steve",
`last-name` = "Rogers",
`email-address` = "example@me.com",
`interest` = "a researcher",
`org` = "The Avengers",
`department` = "Operations",
`postal-address` = "1 Avengers Drive",
`city-name` = "Undisclosed",
`state-province` = "Virginia",
`postal-code` = "09911",
`country-name` = "US",
`opt-in:boolean:default` = "",
`fieldset` = "default",
`form.submitted` = "1",
`add_reference.field:record` = "",
`add_reference.type:record"` = "",
`add_reference.destination:record"` = "",
`last_referer` = "https://www.cpc.unc.edu/projects/china/data/datasets",
`_authenticator` = authenticator,
`form_submit` = "Submit"
),
encode = "multipart"
) -> res
(curlconverter
makes ^^ for you from a simple "copy" of a specific item in Developer Tools)
Hopefully you see where authenticator
comes in.
Now that we got the we need to get to the files.
First we need to get to the download page:
read_html(httr::content(res, as = "text")) %>%
html_nodes(xpath=".//p[contains(., 'You may now')]/strong/a") %>%
html_attr("href") -> dl_pg_link
dl_pg <- httr::GET(url = dl_pg_link)
Then we need to get to the real download page:
httr::content(dl_pg, as = "text") %>%
read_html() %>%
html_nodes(xpath=".//a[contains(@class, 'contenttype-folder state-published url')]") %>%
html_attr("href") -> dls
Then we need to get all the downloadable bits from that page:
zip_pg <- httr::GET(url = dls)
httr::content(zip_pg, as = "text") %>%
read_html() %>%
html_nodes("td > a") %>%
html_attr("href") %>%
gsub("view$", "at_download/file", .) -> dl_links
Here how to get the first one:
(fil1 <- httr::GET(dl_links[1]))
## Response [https://www.cpc.unc.edu/projects/china/data/datasets/data_downloads/longitudinal/weights-chns.pdf/at_download/file]
## Date: 2018-10-14 03:03
## Status: 200
## Content-Type: application/pdf
## Size: 197 kB
## <BINARY BODY>
fil1$headers[["content-disposition"]]
## [1] "attachment; filename=\"weights-chns.pdf\""
writeBin(
httr::content(fil1, as = "raw"),
file.path("~/Data", gsub('"', '', strsplit(fil1$headers[["content-disposition"]], "=")[[1]][2])))
)
(fil2 <- httr::GET(dl_links[2]))
## Response [https://www.cpc.unc.edu/projects/china/data/datasets/data_downloads/longitudinal/Biomarker_2012Dec.zip/at_download/file]
## Date: 2018-10-14 03:06
## Status: 200
## Content-Type: application/zip
## Size: 2.37 MB
## <BINARY BODY>
(which is a PDF) and here's how to get the second one which is a ZIP:
fil2$headers[["content-disposition"]]
## [1] "attachment; filename=\"Biomarker_2012Dec.zip\""
writeBin(
httr::content(fil2, as = "raw"),
file.path("~/Data", gsub('"', '', strsplit(fil2$headers[["content-disposition"]], "=")[[1]][2])))
)
You can turn ^^ into an iterative operation.
Note that you must start from the top of this (i.e. start at the enter your email form page) every time you start a new R session since the underlying curl
package (which powers httr
and rvest
) maintains session state for you (in cookies).