1

I'm trying to scrape county assessor data on historic property values for multiple parcels generated using javascript from https://www.washoecounty.us/assessor/cama/?command=assessment_data&parid=07101001 using phantomjs controlled by RSelenium. 'paraid' in the url is the 9 digit parcel number. I have a dataframe containing a list of parcel numbers that I'm interested in (a few hundred in total), but have been attempting to make the code work on a small subset of those:

parcel_nums
[1] "00905101" "00905102" "00905103" "00905104" "00905105" 
[6] "00905106" "00905107" "00905108" "00905201" "00905202"

I need to scrape the data in the table generated on the page for each parcel and preserve it. I have chosen to write the page to a file "output.htm" and then parse the file afterwards. My code is as follows:

require(plyr)
require(rvest)
require(RSelenium)
require(tidyr)
require(dplyr)

parcel_nums <- prop_attr$APN[1:10]  #Vector of parcel numbers
pJS <- phantom()
remDr <- remoteDriver(browserName = "phantomjs")
remDr$open()

result <- remDr$phantomExecute("var page = this;
                            var fs = require(\"fs\");
                            page.onLoadFinished = function(status) {
                            var file = fs.open(\"output.htm\", \"w\");
                            file.write(page.content);
                            file.close();
                            };")

for (i in 1:length(parcel_nums)){
    url <- paste("https://www.washoecounty.us/assessor/cama/?command=assessment_data&parid=", 
        parcel_nums[i], sep = "")
    Sys.sleep(5)

    emDr$navigate(url)

    dat <- read_html("output.htm", encoding = "UTF-8") %>% 
        html_nodes("table") %>% 
        html_table(, header = T)
    df <- data.frame(dat)

    #assign parcel number to panel
    df$apn <- parcel_nums[i]
    #on first iteratation initialize final data frame, on sebsequent iterations append the final data frame
    ifelse(i == 1, parcel_data <- df, parcel_data <- rbind(parcel_data, df))
}
remDr$close
pJS$stop()

This will work perfectly for one or two iterations of the loop, but it suddenly stops preserving the data generated by the javascript and produces an error:

 Error in `$<-.data.frame`(`*tmp*`, "apn", value = "00905105") : 
 replacement has 1 row, data has 0 

which is due to the parser not locating the table in the output file because it is not being preserved. I'm unsure if there is a problem with the implementation I've chosen or if there is some idiosycrasy of the particular site that is causing the issue. I am not familiar with JavaScript so the code snippet used is taken from an example I found. Thank you for any assistance.

The below answer worked perfectly. I also moved the Sys.sleep(5) to after the $navigate to allow the page time to load the javascript. The loop is now executing to completion.

2 Answers2

0
require(plyr)
require(rvest)
require(RSelenium)
require(tidyr)
require(dplyr)

parcel_nums <- prop_attr$APN[1:10]  #Vector of parcel numbers
#pJS <- phantom()
remDr <- remoteDriver()
remDr$open()

# #result <- remDr$executeScript("var page = this;
#                                var fs = require(\"fs\");
#                                page.onLoadFinished = function(status) {
#                                var file = fs.open(\"output.htm\", \"w\");
#                                file.write(page.content);
#                                file.close();
#                                };")
#length(parcel_nums)
for (i in 1:length(parcel_nums)){
  url <- paste("https://www.washoecounty.us/assessor/cama/?command=assessment_data&parid=", 
               parcel_nums[i], sep = "")
  Sys.sleep(5)

  remDr$navigate(url)
  doc <- htmlParse(remDr$getPageSource()[[1]])
  doc_t<-readHTMLTable(doc,header = TRUE)$`NULL`
  df<-data.frame(doc_t)

  #assign parcel number to panel
  df$apn <- parcel_nums[i]
  #on first iteratation initialize final data frame, on sebsequent iterations append the final data frame
  ifelse(i == 1, parcel_data <- df, parcel_data <- rbind(parcel_data, df))
}
remDr$close

This gave me a solution. And this should work with the the phantomJS too. I request you to test and reply.

Bharath
  • 1,600
  • 14
  • 25
  • I ran your code, but I am receiving an error: 'code'(Undefined error in RCurl call. Error in queryRD(paste0(serverURL, "/session/", sessionInfo$id, "/url"), :) –  Feb 19 '16 at 07:19
  • Ignore that last comment. I added a Sys.sleep after the $open command and the error went away. –  Feb 19 '16 at 07:33
0

I have lost an entire day trying to solve a similar issue. So I share my learning to help others save time and nerves..

I guess we need to understand that opening, navigating and other browsing actions through the remote driver need time to complete. So we have to wait before we try to read or do anything on the pages we are expecting to scrape.

My problems were solved when I introduced Sys.sleep(5) after the remDr$navigate(url) call.

It seems that a neater solution consists of inserting an remDr$setTimeout(type = "page load", milliseconds = 10000) as suggested at how to check if page finished loading in RSelenium but didn't test it yet.

Community
  • 1
  • 1
OAA
  • 73
  • 1
  • 7
  • Agreed. I have found that it is necessary to include a Sys.sleep after the navigate as well. The time needed may depend on the page being scraped and I was able to bring the wait down to 3 seconds per page after checking how long the page took to load using chrome's developer tools. I will try your alternate solution next time I need to scrape. –  Apr 11 '16 at 00:25