27

I am trying to fix a download automation script that I provide publicly so that anyone can easily download the world values survey with R.

On this web page - http://www.worldvaluessurvey.org/WVSDocumentationWV4.jsp - the PDF link "WVS_2000_Questionnaire_Root" easily downloads in firefox and chrome.I cannot figure out how to automate the download with httr or RCurl or any other R package. screenshot below of the chrome internet behavior. That PDF link needs to follow through to the ultimate source of http://www.worldvaluessurvey.org/wvsdc/DC00012/F00001316-WVS_2000_Questionnaire_Root.pdf but if you click their directly, there's a connectivity error. i am unclear if this is related to the request header Upgrade-Insecure-Requests:1 or the response header status code 302

Clicking around the new worldvaluessurvey.org website with chrome's inspect element windows open makes me think there were some hacky coding decisions made here, hence the title semi-broken :/

enter image description here

Arindam Nayak
  • 7,346
  • 4
  • 32
  • 48
Anthony Damico
  • 5,779
  • 7
  • 46
  • 77
  • 2
    wow, quite impressive to sacrifice almost all your reputation to a single question! ;-) – agenis Sep 23 '16 at 08:40
  • 1
    Heck; I'd be happy to support getting a useful answer to this question. If you don't get a solution and award it in the required time, let me know and I'll put up another 500 rep to make sure that it stays featured. Thank you for all your work at making the public datasets accessible, Anthony. – IRTFM Sep 29 '16 at 03:27
  • @42- thanks a lot david, i appreciate it. the headless browsing answer is a good one, but the poster is right that it'd be better within-R only. i'm worried that someone will give a good `RCurl` answer and then the world values survey people will change the website again.. occupational hazard ;) – Anthony Damico Sep 29 '16 at 06:09

4 Answers4

6

I've had to deal with this sort of thing in the past. My solution has been to use a headless browser to programmatically navigate and manipulate the web pages that contained the resources I was interested in. I've even done fairly non-straightforward tasks like logging in and filling out and submitting forms using this method.

I can see that you're trying to use a pure R approach to download these files by reverse-engineering the GET/POST requests that are generated by the link. This could work, but it would leave your implementation highly vulnerable to any future changes in the site's design, such as changes in the JavaScript event handler, URL redirects, or header requirements.

By using a headless browser you can limit your exposure to the top-level URL and a few minimal XPath queries that allow navigation to the target link. Granted, this still ties your code to non-contractual and fairly internal details of the site's design, but it's certainly less of an exposure. This is the hazard of web scraping.


I've always used the Java HtmlUnit library for my headless browsing, which I've found to be quite excellent. Of course, to leverage a Java-based solution from Rland would require spawning a Java process, which would require (1) Java to be installed on the user's machine, (2) the $CLASSPATH to be properly set up to locate the HtmlUnit JARs as well as your custom file-downloading main class, and (3) proper invocation of the Java command with correct arguments using one of R's methods of shelling out to a system command. Needless to say, this is fairly involved and messy.

A pure R headless browsing solution would be nice, but unfortunately, it looks to me like R does not offer any native headless browsing solution. The closest is RSelenium, which appears to be just an R binding to the Java client library of the Selenium browser automation software. This means it will not operate independently of the user's GUI browser, and requires interaction with an external Java process anyway (although in this case the details of the interaction are conveniently encapsulated underneath the RSelenium API).


Using HtmlUnit, I've created a fairly generic Java main class that can be used to download a file by clicking on a link on a web page. The parameterization of the application is as follows:

  • The URL of the page.
  • An optional sequence of XPath expressions to allow descending into any number of nested frames starting from the top-level page. Note: I actually parse this out of the URL argument by splitting on \s*>\s*, which I like as a concise syntax. I used the > character because it is not valid in URLs.
  • A single XPath expression that specifies the anchor link to click.
  • An optional file name under which to save the downloaded file. If omitted, it will be derived from either a Content-Disposition header whose value matches the pattern filename="(.*)" (this was an unusual case I encountered when scraping icons a while back) or, failing that, the basename of the request URL that triggered the file stream response. The basename derivation method works for your target link.

Here's the code:

package com.bgoldst;

import java.util.List;
import java.util.ArrayList;

import java.io.File;
import java.io.FileOutputStream;
import java.io.InputStream;
import java.io.OutputStream;
import java.io.IOException;

import java.util.regex.Pattern;
import java.util.regex.Matcher;

import com.gargoylesoftware.htmlunit.WebClient;
import com.gargoylesoftware.htmlunit.BrowserVersion;
import com.gargoylesoftware.htmlunit.ConfirmHandler;
import com.gargoylesoftware.htmlunit.WebWindowListener;
import com.gargoylesoftware.htmlunit.WebWindowEvent;
import com.gargoylesoftware.htmlunit.WebResponse;
import com.gargoylesoftware.htmlunit.WebRequest;
import com.gargoylesoftware.htmlunit.util.NameValuePair;
import com.gargoylesoftware.htmlunit.Page;
import com.gargoylesoftware.htmlunit.html.HtmlPage;
import com.gargoylesoftware.htmlunit.html.HtmlAnchor;
import com.gargoylesoftware.htmlunit.html.BaseFrameElement;

public class DownloadFileByXPath {

    public static ConfirmHandler s_downloadConfirmHandler = null;
    public static WebWindowListener s_downloadWebWindowListener = null;
    public static String s_saveFile = null;

    public static void main(String[] args) throws Exception {

        if (args.length < 2 || args.length > 3) {
            System.err.println("usage: {url}[>{framexpath}*] {anchorxpath} [{filename}]");
            System.exit(1);
        } // end if
        String url = args[0];
        String anchorXPath = args[1];
        s_saveFile = args.length >= 3 ? args[2] : null;

        // parse the url argument into the actual URL and optional subsequent frame xpaths
        String[] fields = Pattern.compile("\\s*>\\s*").split(url);
        List<String> frameXPaths = new ArrayList<String>();
        if (fields.length > 1) {
            url = fields[0];
            for (int i = 1; i < fields.length; ++i)
                frameXPaths.add(fields[i]);
        } // end if

        // prepare web client to handle download dialog and stream event
        s_downloadConfirmHandler = new ConfirmHandler() {
            public boolean handleConfirm(Page page, String message) {
                return true;
            }
        };
        s_downloadWebWindowListener = new WebWindowListener() {
            public void webWindowContentChanged(WebWindowEvent event) {

                WebResponse response = event.getWebWindow().getEnclosedPage().getWebResponse();

                //System.out.println(response.getLoadTime());
                //System.out.println(response.getStatusCode());
                //System.out.println(response.getContentType());

                // filter for content type
                // will apply simple rejection of spurious text/html responses; could enhance this with command-line option to whitelist
                String contentType = response.getResponseHeaderValue("Content-Type");
                if (contentType.contains("text/html")) return;

                // determine file name to use; derive dynamically from request or response headers if not specified by user
                // 1: user
                String saveFile = s_saveFile;
                // 2: response Content-Disposition
                if (saveFile == null) {
                    Pattern p = Pattern.compile("filename=\"(.*)\"");
                    Matcher m;
                    List<NameValuePair> headers = response.getResponseHeaders();
                    for (NameValuePair header : headers) {
                        String name = header.getName();
                        String value = header.getValue();
                        //System.out.println(name+" : "+value);
                        if (name.equals("Content-Disposition")) {
                            m = p.matcher(value);
                            if (m.find())
                                saveFile = m.group(1);
                        } // end if
                    } // end for
                    if (saveFile != null) saveFile = sanitizeForFileName(saveFile);
                    // 3: request URL
                    if (saveFile == null) {
                        WebRequest request = response.getWebRequest();
                        File requestFile = new File(request.getUrl().getPath());
                        saveFile = requestFile.getName(); // just basename
                    } // end if
                } // end if

                getFileResponse(response,saveFile);

            } // end webWindowContentChanged()
            public void webWindowOpened(WebWindowEvent event) {}
            public void webWindowClosed(WebWindowEvent event) {}
        };

        // initialize browser
        WebClient webClient = new WebClient(BrowserVersion.FIREFOX_45);
        webClient.getOptions().setCssEnabled(false);
        webClient.getOptions().setJavaScriptEnabled(true); // required for JavaScript-powered links
        webClient.getOptions().setThrowExceptionOnScriptError(false);
        webClient.getOptions().setThrowExceptionOnFailingStatusCode(false);

        // 1: get home page
        HtmlPage page;
        try { page = webClient.getPage(url); } catch (IOException e) { throw new Exception("error: could not get URL \""+url+"\".",e); }
        //page.getEnclosingWindow().setName("main window");

        // 2: navigate through frames as specified by the user
        for (int i = 0; i < frameXPaths.size(); ++i) {
            String frameXPath = frameXPaths.get(i);
            List<?> elemList = page.getByXPath(frameXPath);
            if (elemList.size() != 1) throw new Exception("error: frame "+(i+1)+" xpath \""+frameXPath+"\" returned "+elemList.size()+" elements on page \""+page.getTitleText()+"\" >>>\n"+page.asXml()+"\n<<<.");
            if (!(elemList.get(0) instanceof BaseFrameElement)) throw new Exception("error: frame "+(i+1)+" xpath \""+frameXPath+"\" returned a non-frame element on page \""+page.getTitleText()+"\" >>>\n"+page.asXml()+"\n<<<.");
            BaseFrameElement frame = (BaseFrameElement)elemList.get(0);
            Page enclosedPage = frame.getEnclosedPage();
            if (!(enclosedPage instanceof HtmlPage)) throw new Exception("error: frame "+(i+1)+" encloses a non-HTML page.");
            page = (HtmlPage)enclosedPage;
        } // end for

        // 3: get the target anchor element by xpath
        List<?> elemList = page.getByXPath(anchorXPath);
        if (elemList.size() != 1) throw new Exception("error: anchor xpath \""+anchorXPath+"\" returned "+elemList.size()+" elements on page \""+page.getTitleText()+"\" >>>\n"+page.asXml()+"\n<<<.");
        if (!(elemList.get(0) instanceof HtmlAnchor)) throw new Exception("error: anchor xpath \""+anchorXPath+"\" returned a non-anchor element on page \""+page.getTitleText()+"\" >>>\n"+page.asXml()+"\n<<<.");
        HtmlAnchor anchor = (HtmlAnchor)elemList.get(0);

        // 4: click the target anchor with the appropriate confirmation dialog handler and content handler
        webClient.setConfirmHandler(s_downloadConfirmHandler);
        webClient.addWebWindowListener(s_downloadWebWindowListener);
        anchor.click();
        webClient.setConfirmHandler(null);
        webClient.removeWebWindowListener(s_downloadWebWindowListener);

        System.exit(0);

    } // end main()

    public static void getFileResponse(WebResponse response, String fileName ) {

        InputStream inputStream = null;
        OutputStream outputStream = null;

        // write the inputStream to a FileOutputStream
        try {

            System.out.print("streaming file to disk...");

            inputStream = response.getContentAsStream();

            // write the inputStream to a FileOutputStream
            outputStream = new FileOutputStream(new File(fileName));

            int read = 0;
            byte[] bytes = new byte[1024];

            while ((read = inputStream.read(bytes)) != -1)
                outputStream.write(bytes, 0, read);

            System.out.println("done");

        } catch (IOException e) {
            e.printStackTrace();
        } finally {
            if (inputStream != null) {
                try {
                    inputStream.close();
                } catch (IOException e) {
                    e.printStackTrace();
                } // end try-catch
            } // end if
            if (outputStream != null) {
                try {
                    //outputStream.flush();
                    outputStream.close();
                } catch (IOException e) {
                    e.printStackTrace();
                } // end try-catch
            } // end if
        } // end try-catch

    } // end getFileResponse()

    public static String sanitizeForFileName(String unsanitizedStr) {
        return unsanitizedStr.replaceAll("[^\040-\176]","_").replaceAll("[/\\<>|:*?]","_");
    } // end sanitizeForFileName()

} // end class DownloadFileByXPath

Below is a demo of me running the main class on my system. I've snipped out most of HtmlUnit's verbose output. I'll explain the command-line arguments afterward.

ls;
## bin/  src/
CLASSPATH="bin;C:/cygwin/usr/local/share/htmlunit-latest/*" java com.bgoldst.DownloadFileByXPath "http://www.worldvaluessurvey.org/WVSDocumentationWV4.jsp > //iframe[@id='frame1'] > //iframe[@id='frameDoc']" "//a[contains(text(),'WVS_2000_Questionnaire_Root')]";
## Jul 10, 2016 1:34:34 PM com.gargoylesoftware.htmlunit.IncorrectnessListenerImpl notify
## WARNING: Obsolete content type encountered: 'application/x-javascript'.
## Jul 10, 2016 1:34:34 PM com.gargoylesoftware.htmlunit.IncorrectnessListenerImpl notify
## WARNING: Obsolete content type encountered: 'application/x-javascript'.
##
## ... snip ...
##
## Jul 10, 2016 1:34:45 PM com.gargoylesoftware.htmlunit.IncorrectnessListenerImpl notify
## WARNING: Obsolete content type encountered: 'text/javascript'.
## streaming file to disk...done
## 
ls;
## bin/  F00001316-WVS_2000_Questionnaire_Root.pdf*  src/
  • CLASSPATH="bin;C:/cygwin/usr/local/share/htmlunit-latest/*" Here I set the $CLASSPATH for my system using a variable-assignment prefix (note: I was running in the Cygwin bash shell). The .class file I compiled into bin, and I've installed the HtmlUnit JARs into my Cygwin system directory structure, which is probably slightly unusual.
  • java com.bgoldst.DownloadFileByXPath Obviously this is the command word and the name of the main class to execute.
  • "http://www.worldvaluessurvey.org/WVSDocumentationWV4.jsp > //iframe[@id='frame1'] > //iframe[@id='frameDoc']" This is the URL and frame XPath expressions. Your target link is nested under two iframes, thus requiring the two XPath expressions. You can find the id attributes in the source, either by viewing the raw HTML or using a web development tool (Firebug is my favorite).
  • "//a[contains(text(),'WVS_2000_Questionnaire_Root')]" Finally, this is the actual XPath expression for the target link within the inner iframe.

I omitted the file name argument. As you can see, the code correctly derived the name of the file from the request URL.


I recognize that this is a lot of trouble to go through to download a file, but for web scraping in general, I really think the only robust and viable approach is to go the whole nine yards and use a full headless browser engine. It may be best to entirely separate the task of downloading these files from Rland, and instead implement the entire scraping system using a Java application, maybe supplemented with some shell scripts for a more flexible front end. Unless you're working with download URLs that were designed for no-frills one-shot HTTP requests by clients like curl, wget, and R, using R for web scraping is probably not a good idea. That's my two cents.

bgoldst
  • 34,190
  • 6
  • 38
  • 64
4

Using the excellent curlconverter to mimic the browser you can directly request the pdf.

First we mimic the browser initial GET request (may not be necessary a simple GET and keeping the cookie may suffice):

library(curlconverter)
library(httr)
browserGET <- "curl 'http://www.worldvaluessurvey.org/WVSDocumentationWV4.jsp' -H 'Host: www.worldvaluessurvey.org' -H 'User-Agent: Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:49.0) Gecko/20100101 Firefox/49.0' -H 'Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8' -H 'Accept-Language: en-US,en;q=0.5' --compressed -H 'Connection: keep-alive' -H 'Upgrade-Insecure-Requests: 1'"
getDATA <- (straighten(browserGET) %>% make_req)[[1]]()

The JSESSIONID cookie is available at getDATA$cookies$value

getPDF <- "curl 'http://www.worldvaluessurvey.org/wvsdc/DC00012/F00001316-WVS_2000_Questionnaire_Root.pdf' -H 'Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8' -H 'Accept-Encoding: gzip, deflate' -H 'Accept-Language: en-US,en;q=0.5' -H 'Connection: keep-alive' -H 'Cookie: JSESSIONID=59558DE631D107B61F528C952FC6E21F' -H 'Host: www.worldvaluessurvey.org' -H 'Referer: http://www.worldvaluessurvey.org/AJDocumentationSmpl.jsp' -H 'Upgrade-Insecure-Requests: 1' -H 'User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64; rv:49.0) Gecko/20100101 Firefox/49.0'"
appIP <- straighten(getPDF)
# replace cookie
appIP[[1]]$cookies$JSESSIONID <- getDATA$cookies$value
appReq <- make_req(appIP)
response <- appReq[[1]]()
writeBin(response$content, "test.pdf")

The curl strings were plucked straight from the browser and curlconverter then does all the work.

jdharrison
  • 30,085
  • 4
  • 77
  • 89
  • @AnthonyDamico: We need to replenish your rep or protect your new rep. If you want me to sponsor questions, let me know. I think you have my email, since I changed my SO-name from DWin. – IRTFM Sep 29 '16 at 22:25
  • thanks a lot! if my next question comes up before my reputation builds back, i'll send you a note :D – Anthony Damico Sep 30 '16 at 01:13
3

Looking at the code for the DocDownload function, they are primarily just doing a POST to /AJDownload.jsp with post params of ulthost:WVS, CndWAVE: 4, SAID: 0, DOID: (the doc id here), AJArchive: WVS Data Archive. Not sure if some of those are required but probably best to include them anyway.

doing that in R using httr, would look something like this

r <- POST("http://www.worldvaluessurvey.org/AJDownload.jsp", body = list("ulthost" = "WVS", "CndWAVE" = 4, "SAID" = 0, "DOID" = 1316, "AJArchive" = "WVS Data Archive"))

The AJDownload.asp endpoint will return a 302 (redirect to the REAL url), and the httr library should automatically follow the redirect for you. Through trial and error, I determined that the server requires both Content-Type and Cookie headers, otherwise it will return an empty 400 (OK) response. You will need to get a valid cookie, which you can just find by inspecting any page load to that server, and look for the header with Cookie: JSESSIONID=..... , you will want to copy the whole header

So with those in, it looks like

r <- POST("http://www.worldvaluessurvey.org/AJDownload.jsp", body = list("ulthost" = "WVS", "CndWAVE" = 4, "SAID" = 0, "DOID" = 1316, "AJArchive" = "WVS Data Archive"), add_headers("Content-Type" = "application/x-www-form-urlencoded", "Cookie" = "[PASTE COOKIE VALUE HERE]"))

The response is going to be binary pdf data, so you will need to save it to file to be able to do anything with it.

bin <- content(r, "raw")
writeBin(bin, "myfile.txt")

EDIT:

Okay, got some time to actually run the code. I also found out the minimum required parameters for the POST calls, which is just the docid, the JSESSIONID cookie, and Referer header.

library(httr)
download_url <- "http://www.worldvaluessurvey.org/AJDownload.jsp"
frame_url <- "http://www.worldvaluessurvey.org/AJDocumentationSmpl.jsp"
body <- list("DOID" = "1316")

file_r <- POST(download_url, body = body, encode = "form",
          set_cookies("JSESSIONID" = "0E657C37FF030B41C33B7D2B1DCAB3D8"),
          add_headers("Referer" = frame_url),
          verbose())

This worked on my machine and correctly returns the PDF binary data.

This is what happens if I set the cookie manually from my webbrowser. I am only using the JSESSIONID part of the cookie and nothing else. As I mentioned before, the JSESSIONID will expire, likely from age or inactivity. success_image

Ben Abraham
  • 482
  • 5
  • 10
  • thanks but i tried your example a few different ways and none of them worked? `library(httr) ; r <- POST("http://www.worldvaluessurvey.org/AJDownload.jsp", body = list("ulthost" = "WVS", "CndWAVE" = 4, "SAID" = 0, "DOID" = 1316, "AJArchive" = "WVS Data Archive")) ; x <- POST("http://www.worldvaluessurvey.org/AJDownload.jsp", body = list("ulthost" = "WVS", "CndWAVE" = 4, "SAID" = 0, "DOID" = 1316, "AJArchive" = "WVS Data Archive"),add_headers("Content-Type" = "application/x-www-form-urlencoded", "JSESSIONID" = cookies(r)$value)) ; x$content` – Anthony Damico Jul 07 '16 at 16:46
  • I updated with code that I ran locally and worked. Give the changes a try. Also something to note, the JSESSIONID cookie can and will expire. So you will want to make sure you are using a recently used one. To further automate in the future you can request a page (like the listing, or possibly even the homepage) first, and you can extract the cookie from the response of that. – Ben Abraham Jul 07 '16 at 18:55
  • i'm sorry, your example is still not working for me? could you revise your answer so it automates the process (including the pull of `r$cookies$value`) start to finish and runs cleanly in a fresh R session? – Anthony Damico Jul 07 '16 at 23:33
  • I can't seem to get the server to accept the cookie that it generates for the R script, so I don't know whats up with that. However, I updated my code a bit to be more clear and posted an image of my output, this might give you some clues as to where yours is failing. – Ben Abraham Jul 08 '16 at 00:58
  • Worked fine for me on a Mac with that aging cookie. I already had visited the page with Chrome so might have had another cookie to be found in the cookie-jar. – IRTFM Jul 08 '16 at 02:21
  • right, thanks, i am able to reproduce the download by pulling the cookie from chrome, but that doesn't solve the problem, unfortunately. the server gives R invalid cookies while it gives web browsers valid cookies, for some reason :/ any other ideas? – Anthony Damico Jul 08 '16 at 04:52
  • I still can't get R to obtain valid cookies for the site, and even if I generate the URL from its id and file name, it wont let it download without a valid cookie. If you don't need to download fresh versions of the files every time, you can always bulk-download once with a valid cookie and then just load from disk when you need the files. Also, are you looking for just the PDFs or the statistical data files too? (would be my guess since R) – Ben Abraham Jul 08 '16 at 19:45
  • @BenAbraham thanks.. my goal is to restore this [world values survey download automation script](https://github.com/ajdamico/asdfree) that previously worked fine :( asking users to use chrome's inspection features is too much work for most people – Anthony Damico Jul 09 '16 at 01:57
0

Your issue is likely caused by the 302 status code. I could explain just what a 302 code is, but it looks like you could benefit from an explanation of the entire download process:

This is what is happening when a user clicks on that pdf link.

  1. The onclick javascript event is triggered for that link. If you right click on the link and click "Inspect Element" you can see that there is an onclick event set to "DocDownload('1316')". inline javascript onclick event.
  2. If we type in DocDownload in the javascript console however, the browser tells us that DocDownload does not exist as a function.enter image description here
  3. This is because that pdf link is located inside of an iframe inside of the window enter image description here. The dev console in a browser only accesses the variables/functions
Undo
  • 25,519
  • 37
  • 106
  • 129
johnsimer
  • 350
  • 3
  • 16