0

I'm trying to automatically download documents for Oil & Gas wells from the Colorado Oil and Gas Conservation Commission (COGCC) using the "rvest" and "downloader" packages in R.

The link to the table/form that contains the documents for a particular well is; http://ogccweblink.state.co.us/results.aspx?id=12337064

The "id=12337064" is the unique identifier for the well

The documents on the form page can be downloaded by clicking them. An example is below. http://ogccweblink.state.co.us/DownloadDocument.aspx?DocumentId=3172781

The "DocumentID=3172781" is the unique document ID for the document to be downloaded. In this case, an xlsm file. Other file formats on the document page include PDF and xls.

So far I've been able to write a code to download any document for any well but it is limited to only the first page. Majority of the wells have documents on multiple pages and I'm unable to download documents on pages other than page 1 (all document pages have similar URL)

## Extract the document id for document to be downloaded in this case "DIRECTIONAL DATA". Used the SelectorGadget tool to extract the CSS path
library(rvest)
html <- html("http://ogccweblink.state.co.us/results.aspx?id=12337064")
File <- html_nodes(html, "tr:nth-child(24) td:nth-child(4) a")
File <- as(File[[1]],'character')
DocId<-gsub('[^0-9]','',File)
DocId
[1] "3172781"

## To download the document, I use the downloader package
library(downloader)
linkDocId<-paste('http://ogccweblink.state.co.us/DownloadDocument.aspx DocumentId=',DocId,sep='')
download(linkDocId,"DIRECTIONAL DATA" ,mode='wb')

    trying URL 'http://ogccweblink.state.co.us/DownloadDocument.aspx?DocumentId=3172781'
Content type 'application/octet-stream' length 33800 bytes (33 KB)
downloaded 33 KB

Does anyone know how I can modify my code to download documents on other pages?

Many thanks!

Em

user2566907
  • 99
  • 10
  • When you load that page there is a post request containing something like `__EVENTARGUMENT=Page%242`. This parameter seems to govern the data you see. – CL. Aug 21 '15 at 08:33
  • There's clues with rcurl and httr [here](http://stackoverflow.com/questions/5797688/post-request-using-rcurl) (use the dev kit in firefox or chrome to see the request your browser send and mimic them later) – Tensibai Aug 21 '15 at 10:15
  • Thanks for the suggestion @user2706569. I changed the Param to `__EVENTARGUMENT=Page$2` and reran the following code found on [link](http://stackoverflow.com/questions/15853204/how-to-login-and-then-download-a-file-from-aspx-web-pages-with-r) to view documents on page 2 but the post request is still showing documents on page one. The only minor adjustment made to the code was including `eventargument <- as.character("Page$2"); params <- list('__EVENTARGUMENT' = eventargument); html = postForm('http://ogccweblink.state.co.us/results.aspx?id=12337064', .params = params, curl = curl)` – user2566907 Aug 21 '15 at 18:24

1 Answers1

0

You have to use the very same cookie for the second query and pass the viewstate and validation fields as well. Quick example:

  1. Load RCurl and load the URL and preserve the cookie:

    url   <- 'http://ogccweblink.state.co.us/results.aspx?id=12337064'
    library(RCurl)
    curl  <- curlSetOpt(cookiejar = 'cookies.txt', followlocation = TRUE, autoreferer = TRUE, curl = getCurlHandle())
    page1 <- getURL(url, curl = curl)
    
  2. Extract the VIEWSTATE and EVENTVALIDATION values after parsing the HTML:

    page1 <- htmlTreeParse(page1, useInternal = TRUE)
    viewstate  <- xpathSApply(page1, '//input[@name = "__VIEWSTATE"]', xmlGetAttr, 'value')
    validation <- xpathSApply(page1, '//input[@name = "__EVENTVALIDATION"]', xmlGetAttr, 'value')
    
  3. Query the same URL again with the saved cookie, extracted hidden INPUT values and ask for the second page:

    page2 <- postForm(url, curl = curl,
             .params = list(
                 '__EVENTARGUMENT'   = 'Page$2',
                 '__EVENTTARGET'     = 'WQResultGridView',
                 '__VIEWSTATE'       = viewstate,
                 '__EVENTVALIDATION' = validation))
    
  4. Extract the URLs from the table shown on the second page:

    page2 <- htmlTreeParse(page2, useInternal = TRUE)
    xpathSApply(page2, '//td/font/a', xmlGetAttr, 'href')
    
daroczig
  • 28,004
  • 7
  • 90
  • 124
  • Thanks for the prompt response @daroczig. I used your code modification above and couldn't get the document id printed for the 2nd page from the `postForm` function. The document id's need to be extracted before downloading the documents from any page. Looking forward to your response. Thanks! – user2566907 Aug 21 '15 at 21:23
  • @user2566907 -- the `postForm` returns HTML as text, which includes the IDs and can be extracted by `XML` (like I did at the 2nd step above) or `rvest` (by your original example) as well. Just save what `postForm` return and pass that to the `html` function you used in your question -- it's working fine. – daroczig Aug 21 '15 at 21:48
  • I did what you suggested and got the following error message `Error in UseMethod("html_nodes") : no applicable method for 'html_nodes' applied to an object of class "character"` for the html function and `Error in UseMethod("xpathApply") : no applicable method for 'xpathApply' applied to an object of class "character"` for the XML. Could you please show me the code that works for you? Thank! – user2566907 Aug 22 '15 at 19:20
  • @user2566907 I've edited my answer to include an example for that part as well, but I think SO is for asking for general guidelines about a specific question and not a site offering full-blown solutions for a given problem. So you should rather ask separate questions next time (e.g. how to parse the HTML, how to load the second page etc). – daroczig Aug 22 '15 at 19:34