You need to roll up your sleeves and apply some elbow grease.
Before I expand on an answer, I did check the site for scraping policies, including the robots.txt
file. Said file is malformed:
User-agent: *
Disallow:
I suspect the owner of the site meant to have a /
after Disallow:
but nothing says we can't scrape.
Some libraries we'll need:
library(rvest)
library(httr)
library(tidyverse)
I'm using different libraries than you are. If you'd prefer to stick with the XML
and RCurl
packages and base R then you'll need to adapt this r wait for another answer.
The following gets the initial page data that we'll mine for links:
res <- httr::GET("https://www.doogal.co.uk/ElectoralConstituencies.php")
pg <- httr::content(res, as="parsed")
In the event you wanted the CSV of the main page data, here's the URL for it:
html_nodes(pg, "li > a[href*='CSV']") %>%
html_attr("href") %>%
sprintf("https://www.doogal.co.uk/%s", .) -> main_csv_url
Now we need links to the individual constituencies. I inspected the HTML page content with Chrome Developer Tools to figure out the CSS selectors:
constituency_nodes <- html_nodes(pg, "td > a[href*='constituency=']")
constituency_names <- html_text(constituency_nodes)
constituency_ids <- gsub("^E.*=", "", html_attr(constituency_nodes, "href"))
Note that we're only saving off the id's and not the full URLs.
We'll make a helper function to shorten things up. httr::GET()
let's us behave like a browser would:
get_constituency <- function(id) {
httr::GET(
url = "https://www.doogal.co.uk/ElectoralConstituenciesCSV.php",
query = list(constituency = id)
) -> res
httr::stop_for_status(res)
res <- read.csv(text = httr::content(res), stringsAsFactors=FALSE)
as_tibble(res)
}
And, then call our new function for all the constituencies. The following includes a progress bar for free:
pb <- progress_estimated(3)
map_df(constituency_ids[1:3], ~{
pb$tick()$print()
Sys.sleep(5)
get_constituency(.x)
}) -> postcodes_df
glimpse(postcodes_df)
## Observations: 10,860
## Variables: 12
## $ Postcode <chr> "CF33 6PS", "CF33 6PT", "CF33 6PU", "CF33 6RA", "CF33 6R...
## $ In.Use. <chr> "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "No", "Yes", "...
## $ Latitude <dbl> 51.53863, 51.54013, 51.53815, 51.54479, 51.55091, 51.552...
## $ Longitude <dbl> -3.700061, -3.699713, -3.690541, -3.684888, -3.673475, -...
## $ Easting <int> 282191, 282219, 282850, 283259, 284066, 284886, 284613, ...
## $ Northing <int> 183562, 183728, 183493, 184222, 184885, 185007, 183874, ...
## $ Grid.Ref <chr> "SS821835", "SS822837", "SS828834", "SS832842", "SS84084...
## $ Introduced <chr> "7/1/1995 12:00:00 AM", "1/1/1980 12:00:00 AM", "1/1/198...
## $ Terminated <chr> "", "", "", "", "", "", "4/1/2002 12:00:00 AM", "", "1/1...
## $ Altitude <int> 45, 47, 46, 76, 76, 131, 61, 27, 9, 7, 8, 8, 7, 8, 8, 8,...
## $ Population <int> 34, NA, 7, 33, 48, 11, NA, 10, NA, NA, NA, NA, NA, NA, N...
## $ Households <int> 12, NA, 3, 11, 19, 4, NA, 4, NA, NA, NA, NA, NA, NA, NA,...
NOTE:
- I only did 3 iterations since I don't need this data. Remove that limit for your needs.
- Please keep the delay code in there when you do this. You have time and it's their bandwidth & CPU you'd be abusing.
- There's enough data in the code above to add in the constituency name as a data frame field, but that's work left to the reader