0

This question has been asked but I haven't found a solution yet. I want to scrape from a website a number of zipped .dat files. Yet, I am at this point:

library(XML)
url<-c("http://blablabla")
zipped <-htmlParse(url)
nodes_a<-getNodeSet(zipped,"//a")
files<-grep("*.zip",sapply(nodes_a, function(nodes_a) 
xmlGetAttr(nodes_a,"href")),value=TRUE)
urls<-paste(url,files,sep="")

Then I use this:

mapply(function(x,y) download.file(x,y),urls,files)

and this is the Error message I get:

Error in mapply(function(x, y) download.file(x, y), urls, files) : 
 zero-length inputs cannot be mixed with those of non-zero length

Any hint?

Helena
  • 87
  • 9
  • 2
    Can you provide us with a reproducible link that contains a zip? Also, have you tried `mapply(download.file, urls, files)` – Shique May 30 '18 at 07:24
  • Hi. The page is the following: http://www.cpc.unc.edu/projects/china/data/datasets/data_downloads/longitudinal Good idea, I will try with mapply, in the meantime. – Helena May 30 '18 at 12:13
  • That site is password protected. So parsing might not even be possible (is `nodes_a` empty?) Which means you would have to either scrape the data by creating a simulation where you have access to the XML link using either `RSelenium`, `rvest` or `httr`. Check [this](https://stackoverflow.com/questions/24723606/scrape-password-protected-website-in-r?noredirect=1&lq=1). Or if you need to scrape it once, then you can simply copy the XML contents, save it as a file and parse it. – Shique May 30 '18 at 12:27
  • The site is password protected but it just wants you to provide it with a valid email address, no double-check and all. In any case, nodes_a is not empty, I have checked it immediately. I will try the way you suggest, though. Thanks – Helena May 30 '18 at 13:23

1 Answers1

0

The completely useless "please give us your email" page introduces a condition where we have to maintain state for any further navigation or downloading, and start by going to the page with the registration form and scraping it to get an "authenticator token" from the page to pass on to the next request (ostensibly for security purposes):

library(curlconverter)
library(xml2)
library(httr)
library(rvest)

pg <- read_html("https://www.cpc.unc.edu/projects/china/data/datasets/data-downloads-registration")

html_nodes(pg, "input[name='_authenticator']") %>% 
  html_attr("value") -> authenticator

I looked at the POST request the form makes using curlconverter (look on SO for how to use it or read that GitLab project site) and came up with:

httr::POST(
  url = "https://www.cpc.unc.edu/projects/china/data/datasets/data-downloads-registration",
  httr::add_headers(
    `User-Agent` = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:63.0) Gecko/20100101 Firefox/63.0",
    Referer = "https://www.cpc.unc.edu/projects/china/data/datasets/data-downloads-registration"
  ),
  httr::set_cookies(`restriction-/projects/china/data/datasets/data_downloads` = "/projects/china/data/datasets/data_downloads"),
  body = list(
    `first-name` = "Steve",
    `last-name` = "Rogers",
    `email-address` = "example@me.com",
    `interest` = "a researcher",
    `org` = "The Avengers",
    `department` = "Operations",
    `postal-address` = "1 Avengers Drive",
    `city-name` = "Undisclosed",
    `state-province` = "Virginia",
    `postal-code` = "09911",
    `country-name` = "US",
    `opt-in:boolean:default` = "",
    `fieldset` = "default",
    `form.submitted` = "1",
    `add_reference.field:record` = "",
    `add_reference.type:record"` = "",
    `add_reference.destination:record"` = "",
    `last_referer` = "https://www.cpc.unc.edu/projects/china/data/datasets",
    `_authenticator` = authenticator,
    `form_submit` = "Submit"
  ), 
  encode = "multipart"
) -> res

(curlconverter makes ^^ for you from a simple "copy" of a specific item in Developer Tools)

Hopefully you see where authenticator comes in.

Now that we got the we need to get to the files.

First we need to get to the download page:

read_html(httr::content(res, as = "text")) %>% 
  html_nodes(xpath=".//p[contains(., 'You may now')]/strong/a") %>% 
  html_attr("href") -> dl_pg_link

dl_pg <- httr::GET(url = dl_pg_link)

Then we need to get to the real download page:

httr::content(dl_pg, as = "text") %>% 
  read_html() %>% 
  html_nodes(xpath=".//a[contains(@class, 'contenttype-folder state-published url')]") %>% 
  html_attr("href") -> dls

Then we need to get all the downloadable bits from that page:

zip_pg <- httr::GET(url = dls)

httr::content(zip_pg, as = "text") %>% 
  read_html() %>% 
  html_nodes("td > a") %>% 
  html_attr("href") %>% 
  gsub("view$", "at_download/file", .) -> dl_links

Here how to get the first one:

(fil1 <- httr::GET(dl_links[1]))
## Response [https://www.cpc.unc.edu/projects/china/data/datasets/data_downloads/longitudinal/weights-chns.pdf/at_download/file]
##   Date: 2018-10-14 03:03
##   Status: 200
##   Content-Type: application/pdf
##   Size: 197 kB
## <BINARY BODY>

fil1$headers[["content-disposition"]]
## [1] "attachment; filename=\"weights-chns.pdf\""

writeBin(
  httr::content(fil1, as = "raw"),
  file.path("~/Data", gsub('"', '', strsplit(fil1$headers[["content-disposition"]], "=")[[1]][2])))
)

(fil2 <- httr::GET(dl_links[2]))
## Response [https://www.cpc.unc.edu/projects/china/data/datasets/data_downloads/longitudinal/Biomarker_2012Dec.zip/at_download/file]
##   Date: 2018-10-14 03:06
##   Status: 200
##   Content-Type: application/zip
##   Size: 2.37 MB
## <BINARY BODY>

(which is a PDF) and here's how to get the second one which is a ZIP:

fil2$headers[["content-disposition"]]
## [1] "attachment; filename=\"Biomarker_2012Dec.zip\""

writeBin(
  httr::content(fil2, as = "raw"),
  file.path("~/Data", gsub('"', '', strsplit(fil2$headers[["content-disposition"]], "=")[[1]][2])))
)

You can turn ^^ into an iterative operation.

Note that you must start from the top of this (i.e. start at the enter your email form page) every time you start a new R session since the underlying curl package (which powers httr and rvest) maintains session state for you (in cookies).

hrbrmstr
  • 77,368
  • 11
  • 139
  • 205