This question is based on another that I saw closed which generated curiosity as I learned something new about Google Chrome's Inspect Element to create the HTML parsing path for XML::getNodeSet
. While this question was closed as I think it may have been too broad I'll ask a smaller more focused question that may get at the root of the problem.
I tried to help the poster by writing code I typically use for scraping but ran into a wall immediately as the poster wanted elements from Google Chrome's Inspect Element. This is not the same as the HTML from htmlTreeParse
as demonstrated here:
url <- "http://collegecost.ed.gov/scorecard/UniversityProfile.aspx?org=s&id=198969"
doc <- htmlTreeParse(url, useInternalNodes = TRUE)
m <- capture.output(doc)
any(grepl("258.12", m))
## FALSE
But here in Google Chrome's Inspect Element we can see that this information is provided (in yellow):
How can we get the information from Google Chrome's Inspect Element into R? The poster could obviously copy and paste the code into a text editor and parse that way but they are looking to scrape and thus that workflow does not scale. Once the poster can get this info into R they can then use typical HTML parsing techniques (XLM
and RCurl
-fu).