6

This question is based on another that I saw closed which generated curiosity as I learned something new about Google Chrome's Inspect Element to create the HTML parsing path for XML::getNodeSet. While this question was closed as I think it may have been too broad I'll ask a smaller more focused question that may get at the root of the problem.

I tried to help the poster by writing code I typically use for scraping but ran into a wall immediately as the poster wanted elements from Google Chrome's Inspect Element. This is not the same as the HTML from htmlTreeParse as demonstrated here:

url <- "http://collegecost.ed.gov/scorecard/UniversityProfile.aspx?org=s&id=198969"
doc <- htmlTreeParse(url, useInternalNodes = TRUE) 
m <- capture.output(doc)
any(grepl("258.12", m))
## FALSE

But here in Google Chrome's Inspect Element we can see that this information is provided (in yellow):

enter image description here

How can we get the information from Google Chrome's Inspect Element into R? The poster could obviously copy and paste the code into a text editor and parse that way but they are looking to scrape and thus that workflow does not scale. Once the poster can get this info into R they can then use typical HTML parsing techniques (XLM and RCurl-fu).

Community
  • 1
  • 1
Tyler Rinker
  • 108,132
  • 65
  • 322
  • 519
  • 5
    The problem is that information is not in the source of the document. It is modified after load by javascript. So to get the value, you need to let the javascript run. Rather than simple scraping, you could use the [RSelenium](http://cran.r-project.org/web/packages/RSelenium/) package which allows you to interact with something that is more browser like that can process the javascript. Either that or you could look where the js is getting it's data from and access that resource directly. – MrFlick Aug 04 '14 at 13:57

1 Answers1

2

You should be able to scrape the page using something like the following code for RSelenium. You need to have java installed and available on your path for the startServer() line to work (and thus for you to be able to do anything).

library("RSelenium")
checkForServer()
startServer()
remDr <- remoteDriver(remoteServerAddr = "localhost", 
                      port = 4444, 
                      browserName = "firefox"
                      )
url <- "http://collegecost.ed.gov/scorecard/UniversityProfile.aspx?org=s&id=198969"
remDr$open()
remDr$navigate(url)
source <- remDr$getPageSource()[[1]]

Check to make sure it worked according to your test:

> grepl("258.12", source)
[1] TRUE
Thomas
  • 43,637
  • 12
  • 109
  • 140
  • I'm away from R for the week. I'll test it when I get back. Thank you. – Tyler Rinker Aug 04 '14 at 20:58
  • 1
    If you use the dev version `devtools::install_github("ropensci/RSelenium")` and have `phantomjs` installed you can drive `phantomjs` without the need for a selenium server. Details at `?RSelenium::phantom`. Also maybe an advantage in that `phantomjs` is headless. – jdharrison Aug 04 '14 at 23:22