1

The BCOGC keeps a database of applications for drilling wells in northeast British Columbia. By default, some filters are active to only highlight approved applications within the last month, even though the application database holds 30K+ records:

enter image description here

When the filter is deactivated:

enter image description here

To download the entire data set, remove or deactivate any filters, click on Actions > Download > CSV. I want to download the entire data set (containing 30K+ records) automatically using R.

When I use

library(tidyverse)

df <- read_csv(
  file = 'https://reports.bcogc.ca/ogc/f?p=200:21::CSV::::'
)

it only downloads whatever the default query specifies, so around 150 records, not 30K+.

How can I use R to download the entire data set automatically? Is this a task for httr or RSelenium?

hrbrmstr
  • 77,368
  • 11
  • 139
  • 205
Werner
  • 14,324
  • 7
  • 55
  • 77
  • Most likely there is javascript involved so you'd need something like `RSelenium` to execute that javascript for you. – MrFlick Nov 26 '18 at 20:26
  • This falls into the 1% of sites that actually require a heavyweight third-party dependency like Selenium or Splash (most don't and I rly with SO contributors would stop defaulting to that suggestion). It would normally be possible to mimic the browser flow with `httr` verbs but it calculates a hidden form variable `p_request` in-page via JS that the very limited `V8` package cannot execute properly. The back-end is an Oracle application server and it uses a couple `
    ` `POST`s to transition state which is maintained on the back end.
    – hrbrmstr Nov 26 '18 at 20:51
  • Also, there is an session-generated value between the `::` in `p=200:21::CSV` that gets put there after the whole workflow. So, I'd use Selenium (via `seleniumPipes`) or Splash (via `splashr`) to do this. – hrbrmstr Nov 26 '18 at 20:53
  • If you want to see application flow there are two different sessions in http://rud.is/dl/bcogc-proxy-sessions.burp which can be read in using this source-only package: https://github.com/hrbrmstr/burrp – hrbrmstr Nov 26 '18 at 20:55
  • @hrbrmstr: Thanks for this info. Yes, I removed the session-generated value from the URL used in my `read_csv(file = '...')` since it seemed to retrieve information sufficiently, albeit not everything. I'm completely unfamiliar with the process of instructing a browser to automate the different components (removing a/the filter and then downloading the content... any assistance would be appreciated. – Werner Nov 26 '18 at 20:58
  • What platform are you on? Docker tends to be the best way to get Selenium or Splash to run but Windows folks often have issues with it so I want to be able to show what will work best for you. – hrbrmstr Nov 26 '18 at 21:03
  • @hrbrmstr: I am running on Windows. :-| – Werner Nov 26 '18 at 21:04

1 Answers1

1

OK, I'm going to go with Selenium then since it doesn't necessarily require Docker (though the example I'm using is with Docker :-) Pretty sure I could get Splash/splashr to do this as well, but it involves a file download and I think there's issues with that and the Splash back-end. As the splashr author, I avoid having to deal with GitHub issues if I use Selenium for this example as well ;-)

Anyway, you should install RSelenium. I can't really provide support for that but it's well documented and the rOpenSci folks are super helpful. I'd highly suggest getting Docker to run on your system or getting your department to setup a Selenium server you all can use.

There are a couple gotchas for this use-case:

  • Some element names we need to instrument are dynamically generated so we have to work around that
  • This involves downloading a CSV file so we need to map a filesystem path in Docker so it downloads properly
  • This is a super slow site so you need to figure out wait times after each interaction (I'm not going to do that since you may be on a slower or faster network and network speed does play a part here, tho not that much)

I'd suggest working through the vignettes for RSelenium before trying the below to get a feel for how it works. You're essentially coding up human page interactions.

You will need to start Docker with a mapped directory. See download file with Rselenium & docker toolbox for all the info but here's how I did it on my macOS box:

docker run -d -v /Users/hrbrmstr/Downloads://home/seluser/Downloads  -p 4445:4444 selenium/standalone-firefox:2.53.1

That makes Selenium accessible on port 4445, uses Firefox (b/c Chrome is evil) and maps my local downloads directory to the Firefox default dir for the selenium user in the Docker container. That means well_authorizations_issued.csv is going to go there (eventually).

Now, we need to crank up R and connect it to this Selenium instance. We need to create a custom Firefox profile since we're saving stuff to disk and we don't want the browser to prompt us for anything:

library(RSelenium)

makeFirefoxProfile(
  list(
    browser.download.dir = "home/seluser/Downloads",
    browser.download.folderList = 2L,
    browser.download.manager.showWhenStarting = FALSE, 
    browser.helperApps.neverAsk.saveToDisk =  "text/csv"
  )
) -> ffox_prof

remoteDriver(
  browserName = "firefox", port = 4445L,
  extraCapabilities = ffox_prof
) -> remDr

invisible(remDr$open())
remDr$navigate("https://reports.bcogc.ca/ogc/f?p=AMS_REPORTS:WA_ISSUED")
# Sys.sleep(###)
magick::image_read(openssl::base64_decode(remDr$screenshot()[[1]]))

You will need to uncomment out the Sys.sleep()s and experiment with various "wait time" values between calls. Some will be short (1-2s) others will be larger (20s, 30s, or higher).

I'm not displaying the output of the screenshots here but those are one way to figure out timings (i.e. keep generating screen shots after an element interaction until gray spinner boxes are gone — etc — and keep a mental note of how many seconds that was).

Now, the one tricky bit noted above is figuring out the where the checkbox is to turn off the filter since it has a dynamic id. However, we aren't actually going to click on the checkbox b/c the daft fools who created that app have no idea what they are doing and actually have the click-event trapped with the span element that surrounds it, so we have to find the li element that contains the checkbox label text then go to the span element and click on it.

box <- remDr$findElement(using = "xpath", value = "//li[contains(., 'Approval Date is in the last')]/span")
box$clickElement()
# Sys.sleep(###)
magick::image_read(openssl::base64_decode(remDr$screenshot()[[1]]))

^^ definitely needs a delay (you likely saw it spin a while in-person when clicking yourself so you can count that and add in some buffer seconds).

Then, we click on the drop-down "menu" (it's really a button):

btn1 <- remDr$findElement(using = "css", "button#WA_ISSUED_actions_button")
btn1$clickElement()
# Sys.sleep(###)
magick::image_read(openssl::base64_decode(remDr$screenshot()[[1]]))

Then the download "menu" item (it's really a button:

btn2 <- remDr$findElement(using = "css", "button#WA_ISSUED_actions_menu_14i")
btn2$clickElement()
# Sys.sleep(###)
magick::image_read(openssl::base64_decode(remDr$screenshot()[[1]]))

^^ also needs rly needs a delay as the Download "dialog" takes a few seconds to come up (it did for me at least).

Now, find the CSV box which is really an a tag:

lnk <- remDr$findElement(using = "css", "a#WA_ISSUED_download_CSV")
lnk$clickElement()
### WAIT A WHILE
magick::image_read(openssl::base64_decode(remDr$screenshot()[[1]]))

That last bit is something you'll have to experiment with. It takes a while to process the request and then transfer the ~9MB file. The call to rmDr$screenshot() actually waits for the download to complete so you can remove the display and decoding code and assign the output to a variable and use that as an automatic "wait"er.

I tried this 3x on 2 different macOS systems and it worked fine. YMMV.

I'm guessing you'll want to automate this eventually so you could have a system() call towards the top of the script that starts the Selenium Docker container, then does the rest of the bits and then issues another system() call to shut down the Docker container.

Alternately, https://github.com/richfitz/stevedore is now on CRAN so it is a pure R interface to starting/stopping Docker containers (amongst many other things) so you could use that instead of system() calls.

If you can't use Docker, you need to install a "webdriver" executable for Firefox on your Windows box and also get the Selenium Java archive, ensure you have Java installed and then do the various manual incantations to get that going (which is beyond the scope of this answer).

Here's a shortened, contiguous version of the above:

library(RSelenium)

# start Selenium before doing this

makeFirefoxProfile(
  list(
    browser.download.dir = "home/seluser/Downloads",
    browser.download.folderList = 2L,
    browser.download.manager.showWhenStarting = FALSE, 
    browser.helperApps.neverAsk.saveToDisk =  "text/csv"
  )
) -> ffox_prof

remoteDriver(
  browserName = "firefox", port = 4445L,
  extraCapabilities = ffox_prof
) -> remDr

invisible(remDr$open())
remDr$navigate("https://reports.bcogc.ca/ogc/f?p=AMS_REPORTS:WA_ISSUED")
# Sys.sleep(###)

box <- remDr$findElement(using = "xpath", value = "//li[contains(., 'Approval Date is in the last')]/span")
box$clickElement()
# Sys.sleep(###)

btn1 <- remDr$findElement(using = "css", "button#WA_ISSUED_actions_button")
btn1$clickElement()
# Sys.sleep(###)

btn2 <- remDr$findElement(using = "css", "button#WA_ISSUED_actions_menu_14i")
btn2$clickElement()
# Sys.sleep(###)

lnk <- remDr$findElement(using = "css", "a#WA_ISSUED_download_CSV")
lnk$clickElement()
### WAIT A WHILE
done <- remDr$screenshot()

# stop Selenium 
hrbrmstr
  • 77,368
  • 11
  • 139
  • 205