Webscrape tables on websites that use AngularJS using R

Question

Using R (while using the packages rvest,jsonlite and httr) am trying to programmatically download all the data files available at the following URL:

http://environment.data.gov.uk/ds/survey/index.jsp#/survey?grid=TQ38

I have tried to use Chrome and use "Inspect" and then Source for the download options, but it appears to be using ng tables and AngularJS as a method to retrieve the final URL to download the dataset. The index.jsp file seems to reference to a javascript file downloads/ea.downloads.js which looks valuable, but am unsure how to find it to understand what functions I need to call.

Ideally the first result would be a data.frame or data.table with a column that has the Product and a column that has URLs of the files to be downloaded for each. This would be useful so that I can subsequently loop through the rows of the table and download each zip file.

I think this AngularJS issue is similar to this questions

web scrape with rvest

But cannot workout how my code should be adjusted for this example.

the `splashr` package by @hrbrmstr might be useful. here is a vignette: https://rud.is/b/2017/02/09/diving-into-dynamic-website-content-with-splashr/ — chinsoon12, Feb 14 '17 at 00:52

Dave2e · Answer 1 · 2017-02-14T01:31:15.510

2

I am sure there is a better solution. This is not a final solution but is a start. It appears the data you are looking for is stored in a JSON file associated with the main page. Once that file is downloaded, you can then process it in order to determine desired files to download.

library(httr)
library(jsonlite)

#base URL for JSON file (found by examining files downloaded by page load)
curl <-'http://www.geostore.com/environment-agency/rest/product/OS_GB_10KM/TQ28?catalogName=Survey'
datafile<-GET(curl)

#process file and flatten to dataframe.
output<- content(datafile, as="text") %>% fromJSON(flatten=FALSE) 
#examine this dataframe to identified desired files.

#baseurl was determined by manually downloading 1 file
baseurl<- "http://www.geostore.com/environment-agency/rest/product/download/"
#sample on downloading the file given the base URL and guid.
#random selecting row 49 to test the download.
download.file(paste0(baseurl, output$guid[49]), output$fileName[49], method="auto")

The naming scheme from the site is confusing, I will leave that to the experts to determine the meaning.

edited Feb 14 '17 at 01:31

answered Feb 14 '17 at 01:18

Dave2e

22,192
18
42
50

Thank you @Dave2e ! That looks really good. For future learning, can I get some further information on what you specifically did to get the base URL "by examining files downloaded by page load". (i.e. how did you ding that) and then how did you know to create the baseurl of "http://www.geostore.com/environment-agency/rest/product/download/" – h.l.m Feb 14 '17 at 09:36
See the link above from chinsoon, the link is a good reference and the author is very active here at stackoverflow. Look for his post/answers. As far a the base URL, I just clicked on a link to manually download the file and my browser's download history stored the URL of the file. – Dave2e Feb 14 '17 at 11:36
Nice work @Dave2e! You can simplify the first bit by just using `jsonlite::fromJSON()` (no need for the `GET`/`content`. I'd suggest using `purrr::walk2` + `download.file` to sequentially get the files. I'd normally just do `download.file(URLS, FILES, method="libcurl")` but these are giant, slow files and there are 88 of them and `libcurl` is going to be resource greedy. Awesome web page session spelunking all the way 'round thol – hrbrmstr Feb 14 '17 at 11:59
@h.l.m you need to open up "Developer Tools" and then refresh the page. Look on the Network "tab" in there and filter on XHR requests then poke around a bit. It's def possible to make a generic solution for any map sector using `splashr` or `seleniumPipes`, too. – hrbrmstr Feb 14 '17 at 12:01

score 2 · Answer 2 · answered Feb 14 '17 at 12:14

A slight expansion on Dave2e's solution demonstrating how to get the XHR JSON resource with splashr:

library(splashr) # devtools::install_github("hrbrmstr/splashr)
library(tidyverse)

splashr requires a Splash server and the pkg provides a way to start one with Docker. Read the help on the github pg and inside the pkg to find out how to use that.

vm <- start_splash() 

URL <- "http://environment.data.gov.uk/ds/survey/index.jsp#/survey?grid=TQ38"

This retrieves all the resources loaded by the page:

splash_local %>% render_har(URL) -> resources # get ALL the items the page loads

stop_splash(vm) # we don't need the splash server anymore

This targets the background XHR resource with catalogName in it. You'd still need to hunt to find this initially, but once you know the pattern, this becomes a generic operation for other grid points.

map_chr(resources$log$entries, c("request", "url")) %>% 
  grep("catalogName", ., value=TRUE) -> files_json

files_json
## [1] "http://www.geostore.com/environment-agency/rest/product/OS_GB_10KM/TQ38?catalogName=Survey"

Read that in:

guids <- jsonlite::fromJSON(files_json)

glimpse(guids)
## Observations: 98
## Variables: 12
## $ id              <int> 170653, 170659, 170560, 170565, 178307, 178189, 201556, 238...
## $ guid            <chr> "54595a8c-b267-11e6-93d3-9457a5578ca0", "63176082-b267-11e6...
## $ pyramid         <chr> "LIDAR-DSM-1M-ENGLAND-2003-EA", "LIDAR-DSM-1M-ENGLAND-2003-...
## $ tileReference   <chr> "TQ38", "TQ38", "TQ38", "TQ38", "TQ38", "TQ38", "TQ38", "TQ...
## $ fileName        <chr> "LIDAR-DSM-1M-2003-TQ3580.zip", "LIDAR-DSM-1M-2003-TQ3585.z...
## $ coverageLayer   <chr> "LIDAR-DSM-1M-ENGLAND-2003-EA-MD-YY", "LIDAR-DSM-1M-ENGLAND...
## $ fileSize        <int> 76177943, 52109669, 59326278, 18048623, 13204420, 11919071,...
## $ descriptiveName <chr> "LIDAR Tiles DSM at 1m spatial resolution 2003", "LIDAR Til...
## $ description     <chr> "1m", "1m", "1m", "1m", "1m", "1m", "1m", "1m", "1m", "1m",...
## $ groupName       <chr> "LIDAR-DSM-TIMESTAMPED-ENGLAND-2003-EA", "LIDAR-DSM-TIMESTA...
## $ displayOrder    <int> -100, -100, -100, -100, -100, -100, -100, -100, -100, -100,...
## $ metaDataUrl     <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, "https://data.g...

The rest is similar to the other answer:

dl_base <- "http://www.geostore.com/environment-agency/rest/product/download"
urls <- sprintf("%s/%s", dl_base, guids$guid)

Be kind to your network and their server:

walk2(urls, guids$fileName, download.file)

Do this if you think your system and their server can handle 98 simultaneous 70-100MB file downloads

download.file(urls, guids$fileName)

Webscrape tables on websites that use AngularJS using R

2 Answers2

Linked