0

I'm working on scripting some dataset downloads in R from the Center for Survey and Survey/Registrar data, this nesstar-based data archive: http://cssr.surveybank.aau.dk/webview

Poking around, I've found there are bookmarkable links for each dataset in each format, e.g., http://cssr.surveybank.aau.dk/webview/velocity?format=STATA&includeDocumentation=on&execute=&ddiformat=pdf&study=http%3A%2F%2F172.18.36.233%3A80%2Fobj%2FfStudy%2FElectionStudy-1973&analysismode=table&v=2&mode=download

There's no username or password required to use the site, so that's one bullet dodged. But the next step is to click on the "Download" button, and that's where I'm stumped. This question Using R to "click" a download file button on a webpage sounds like it should be right on, but this webpage actually isn't similar. Unlike that one, this button is not part of a form, so my efforts using html_form() and submit_form() predictably got nowhere. (And it's not a link, so of course follow_link() won't work either.) The following gets me to the right node, but doesn't actually click the button.

library(magrittr)
library(rvest)

url <- "http://cssr.surveybank.aau.dk/webview/velocity?format=STATA&includeDocumentation=on&execute=&ddiformat=pdf&study=http%3A%2F%2F172.18.36.233%3A80%2Fobj%2FfStudy%2FElectionStudy-1973&analysismode=table&v=2&mode=download"
s <- html_session(url)
download_button <- s %>% html_node(".button")

Now that RSelenium is back on CRAN (yay!), I suppose I could go in that direction instead, but I'd really prefer an rvest or httr-based solution. If anyone could help, I'd really appreciate it.

Frederick Solt
  • 356
  • 3
  • 10
  • Perhaps you should also consider the more modern option: the `decapitated` package. I know it sounds like a Halloween joke, but it refers to the use of a "headless" browser. There are some examples of its use on SO. – IRTFM Oct 14 '18 at 00:43
  • 2
    Or, perhaps, choose not to violate the controls in the [robots.txt](http://cssr.surveybank.aau.dk/robots.txt) of the site. Most of those links hit URLs with those restricted prefixes and violating that control says "my needs are more important than the site's". – hrbrmstr Oct 14 '18 at 01:28
  • Excellent point; I’d neglected to check. I should know better. Thanks for all your help on this site! – Frederick Solt Oct 14 '18 at 11:22

0 Answers0