I'm attempting to create data frames by attaching URLs to a scraped HTML table, and then writing these to individual csv files. The data are concerned with the passage of Bills through their respective stages in both the House of Commons and Lords. I've written a function (see below) which reads the tables, parses the HTML code, scrapes the URLS required, binds the two together, extracts the rows concerned with the House of Lords, and then writes the csv files. This function is then run across two lists (one of links to the Bill stage page and another of simplified file names).
library(XML)
lords_tables <- function (x, y) {
tables <- as.data.frame(readHTMLTable(x))
sitePage <- htmlParse(x) # This parses web code
hrefs <- xpathSApply(sitePage, "//td/descendant::a[1]",
xmlGetAttr, 'href') ## First href child of the a nodes
table_bind <- cbind(tables, hrefs)
row_no <- grep(".+: House of Lords|Royal Assent",
table_bind$NULL.V2) #Gives row position of Lords|Royal Assent
lords_rows <- table_bind[grep(".+: House of Lords|Royal Assent", table_bind$NULL.V2), ] # Subsets rows containing House of Lords|Royal Assent
write.csv(lords_rows, file = paste0(y, ".csv"))
}
# x = a list of links to the Bill pages/ y = list of simplified names
mapply(lords_tables, x=link_list, y=gsub_URL)
This works perfectly well for the cases where debates occurred for every stage. However, some cases pose a problem, such as:
browseURL("http://services.parliament.uk/bills/2010-12/armedforces/stages.html")
For this example, no debate occurred at the '3rd reading: House of Commons' and again at the 'Royal Assent'. This results in the following error being returned:
Error in data.frame(..., check.names = FALSE) :
arguments imply differing number of rows: 21, 19
In overcoming this error I'd like to have an NA against the missing stage. Has anyone got an idea of how to achieve this? I'm a relative n00b so feel free to suggest a more elegant approach to the whole problem.
Thanks in advance!