13

I am trying to get R to complete the 'Search by postcode' field on this webpage http://cti.voa.gov.uk/cti/ with predefined text (e.g. BN1 1NA), advance to the next page and scrape the resulting 4 column table, which, depending on the postcode, can be over multiple pages. To make it more complex the 'Improvement indicator' is not a text field, rather an image file (as seen if you search with postcode BN1 3HP). I would prefer this column to either contain a 0 or 1 depending on if the image is present.

Ultimately I am after a nice data frame that mirrors the 4 columns on screen.

I have tried to modify the suggestions from this question to do what I have described above with no luck, and to be honest I am out of my depth trying to decipher this one.

I realise R may not be the most suited for what I need to do, but it's all I have available to me. Any help would be greatly appreciated.

Community
  • 1
  • 1
Chris
  • 1,197
  • 9
  • 28
  • I have tried to use `look<-getHTMLFormDescription("http://cti.voa.gov.uk/cti/") ; look<-look[[1]]; look(txtPostCode="W2 4RH"); ` but this give me "Error: Not Found ". – dax90 Jul 11 '15 at 01:20

2 Answers2

5

I'm not sure what the T&C of the VOA website have to say about scraping, but this code will do the job:

library("httr")
library("rvest")
post_code <- "B1 1"
resp <- POST("http://cti.voa.gov.uk/cti/InitS.asp?lcn=0",
             encode = "form",
             body = list(btnPush = 1,
                         txtPageNum = 0,
                         txtPostCode = post_code,
                         txtRedirectTo = "InitS.asp",
                         txtStartKey = 0))
resp_cont <- read_html(resp)
council_table <- resp_cont %>%
  html_node(".scl_complex table") %>%
  html_table

Firebug has an excellent 'Net' panel where the POST headers can be seen. Most modern browsers also have something similar built in.

Nick Kennedy
  • 12,510
  • 2
  • 30
  • 52
  • Thanks, that works great for the first page of results. Is there anyway to get it go through all the pages and do the same? B1 1 for example brings back 141 pages of results. – Chris Jul 12 '15 at 18:23
4

I use RSelenium to scrap a council tax list of an Exeter postcode:

library(RSelenium)
library(RCurl)
input = 'EX4 2NU'
appURL <- "http://cti.voa.gov.uk/cti/"
RSelenium::startServer()
remDr <- remoteDriver()
remDr$open()
Sys.sleep(5)
remDr$navigate(appURL)
search.form <- remDr$findElement(using = "xpath", "//*[@id='txtPostCode']")
search.form$sendKeysToElement(list(input, key = "enter"))
doc <- remDr$getPageSource()
tbl = xpathSApply(htmlParse(doc[[1]]),'//tbody')
temp1 = readHTMLTable(tbl[[1]],header=F)

v = length(xpathSApply(htmlParse(doc[[1]]),'//a[@class="next"]'))
while (v != 0) {
    nextpage <- remDr$findElement(using = "xpath", "//*[@class = 'next']")
    nextpage$clickElement()
    doc <- remDr$getPageSource()
    tbl = xpathSApply(htmlParse(doc[[1]]),'//tbody')
    temp2 = readHTMLTable(tbl[[1]],header=F)
    temp1 = rbind(temp1,temp2)
    v = length(xpathSApply(htmlParse(doc[[1]]),'//a[@class="next"]'))
}
finaltable = temp1

Hope you find it helpful. With this one you can scrap multiple page data.

Stan Yip
  • 191
  • 8