Scraping data from different webpages using R

Question

I want to be able to take data from numerous tables hosted on a website. The catch is that they are all on different webpages.

As an example, here are links to the electoral constituencies of the UK. As you can see, all of the constituencies are there and each links through to a separate page. If you go to an individual constituency page, you have the option of downloading a .csv file of the postcodes, or there is an html page.

I have found numerous explanations of how to do this when the various data sources are all on the same page, but is it possible to do something that will create a datafile that combines the postcode data from each area?

For example, I have got the data for the first area, Aberavon, using the following code, which I identified a version of in the answer to this question.

library(XML)
library(RCurl)
install.packages("rlist")
library(rlist)

theurl <- getURL("https://www.doogal.co.uk/ElectoralConstituencies.php?constituency=W07000049",.opts = list(ssl.verifypeer = FALSE) )
tables <- readHTMLTable(theurl)
tables <- list.clean(tables, fun = is.null, recursive = FALSE)
n.rows <- unlist(lapply(tables, function(t) dim(t)[1]))

I usually use R so would be good to know how to do it using R but appreciate some other approach might be better suited and am happy to try others. I am quite new to data scraping so if this is really obvious I might not have understood the limitations of the instructions I have read!

score 0 · Answer 1 · answered Sep 27 '17 at 12:11

You need to roll up your sleeves and apply some elbow grease.

Before I expand on an answer, I did check the site for scraping policies, including the robots.txt file. Said file is malformed:

User-agent: *
Disallow:

I suspect the owner of the site meant to have a / after Disallow: but nothing says we can't scrape.

Some libraries we'll need:

library(rvest)
library(httr)
library(tidyverse)

I'm using different libraries than you are. If you'd prefer to stick with the XML and RCurl packages and base R then you'll need to adapt this r wait for another answer.

The following gets the initial page data that we'll mine for links:

res <- httr::GET("https://www.doogal.co.uk/ElectoralConstituencies.php")

pg <- httr::content(res, as="parsed")

In the event you wanted the CSV of the main page data, here's the URL for it:

html_nodes(pg, "li > a[href*='CSV']") %>% 
  html_attr("href") %>% 
  sprintf("https://www.doogal.co.uk/%s", .) -> main_csv_url

Now we need links to the individual constituencies. I inspected the HTML page content with Chrome Developer Tools to figure out the CSS selectors:

constituency_nodes <- html_nodes(pg, "td > a[href*='constituency=']") 
constituency_names <- html_text(constituency_nodes)
constituency_ids <- gsub("^E.*=", "", html_attr(constituency_nodes, "href"))

Note that we're only saving off the id's and not the full URLs.

We'll make a helper function to shorten things up. httr::GET() let's us behave like a browser would:

get_constituency <- function(id) {

  httr::GET(
    url = "https://www.doogal.co.uk/ElectoralConstituenciesCSV.php", 
    query = list(constituency = id)
  ) -> res

  httr::stop_for_status(res)

  res <- read.csv(text = httr::content(res), stringsAsFactors=FALSE)
  as_tibble(res)

}

And, then call our new function for all the constituencies. The following includes a progress bar for free:

pb <- progress_estimated(3)
map_df(constituency_ids[1:3], ~{

  pb$tick()$print()

  Sys.sleep(5)

  get_constituency(.x)

}) -> postcodes_df

glimpse(postcodes_df)
## Observations: 10,860
## Variables: 12
## $ Postcode   <chr> "CF33 6PS", "CF33 6PT", "CF33 6PU", "CF33 6RA", "CF33 6R...
## $ In.Use.    <chr> "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "No", "Yes", "...
## $ Latitude   <dbl> 51.53863, 51.54013, 51.53815, 51.54479, 51.55091, 51.552...
## $ Longitude  <dbl> -3.700061, -3.699713, -3.690541, -3.684888, -3.673475, -...
## $ Easting    <int> 282191, 282219, 282850, 283259, 284066, 284886, 284613, ...
## $ Northing   <int> 183562, 183728, 183493, 184222, 184885, 185007, 183874, ...
## $ Grid.Ref   <chr> "SS821835", "SS822837", "SS828834", "SS832842", "SS84084...
## $ Introduced <chr> "7/1/1995 12:00:00 AM", "1/1/1980 12:00:00 AM", "1/1/198...
## $ Terminated <chr> "", "", "", "", "", "", "4/1/2002 12:00:00 AM", "", "1/1...
## $ Altitude   <int> 45, 47, 46, 76, 76, 131, 61, 27, 9, 7, 8, 8, 7, 8, 8, 8,...
## $ Population <int> 34, NA, 7, 33, 48, 11, NA, 10, NA, NA, NA, NA, NA, NA, N...
## $ Households <int> 12, NA, 3, 11, 19, 4, NA, 4, NA, NA, NA, NA, NA, NA, NA,...

NOTE:

I only did 3 iterations since I don't need this data. Remove that limit for your needs.
Please keep the delay code in there when you do this. You have time and it's their bandwidth & CPU you'd be abusing.
There's enough data in the code above to add in the constituency name as a data frame field, but that's work left to the reader

Scraping data from different webpages using R

1 Answers1