How to deal with missing row when binding column to data frame (a scraping issue!)

Question

I'm attempting to create data frames by attaching URLs to a scraped HTML table, and then writing these to individual csv files. The data are concerned with the passage of Bills through their respective stages in both the House of Commons and Lords. I've written a function (see below) which reads the tables, parses the HTML code, scrapes the URLS required, binds the two together, extracts the rows concerned with the House of Lords, and then writes the csv files. This function is then run across two lists (one of links to the Bill stage page and another of simplified file names).

    library(XML)

    lords_tables <- function (x, y) {
      tables <- as.data.frame(readHTMLTable(x)) 
      sitePage <- htmlParse(x) # This parses web code
      hrefs <- xpathSApply(sitePage, "//td/descendant::a[1]", 
                   xmlGetAttr, 'href') ## First href child of the a nodes
      table_bind <- cbind(tables, hrefs) 
      row_no <- grep(".+: House of Lords|Royal Assent", 
                         table_bind$NULL.V2)   #Gives row position of Lords|Royal Assent
      lords_rows <- table_bind[grep(".+: House of Lords|Royal Assent", table_bind$NULL.V2), ]  # Subsets rows containing House of Lords|Royal Assent

    write.csv(lords_rows, file = paste0(y, ".csv"))
    }


    # x = a list of links to the Bill pages/ y = list of simplified names
    mapply(lords_tables, x=link_list, y=gsub_URL)

This works perfectly well for the cases where debates occurred for every stage. However, some cases pose a problem, such as:

    browseURL("http://services.parliament.uk/bills/2010-12/armedforces/stages.html")

For this example, no debate occurred at the '3rd reading: House of Commons' and again at the 'Royal Assent'. This results in the following error being returned:

    Error in data.frame(..., check.names = FALSE) : 
     arguments imply differing number of rows: 21, 19

In overcoming this error I'd like to have an NA against the missing stage. Has anyone got an idea of how to achieve this? I'm a relative n00b so feel free to suggest a more elegant approach to the whole problem.

Thanks in advance!

You might find this helpful: http://stackoverflow.com/questions/12193779/how-to-write-trycatch-in-r Or maybe do an if/else for when the size of the table is 0? — Rafael, Apr 04 '17 at 20:38
The key is to grab a set of ancestor tags that each contain the two things you're interested in, and then iterate over those tags, grabbing both pieces you want, and then `rbind`ing at the end. There are a lot of examples if you search, though a lot of them use rvest instead of the XML package. — alistaire, Apr 04 '17 at 22:17
Thanks both. I was searching for answers as I'm pretty sure it's a common problem, but didn't know how to word the search term! The solution using rvest looks the more promising. — Andrew Jones, Apr 05 '17 at 13:16

How to deal with missing row when binding column to data frame (a scraping issue!)

0 Answers0