I suggest using the XML
package and XPath. This requires some learning, but if you're serious about web scraping, it's the way to go. I did this with some county level elections data from the NY Times website ages ago, and the code looked something like this (just to give you an idea):
getCounty <- function(url) {
doc = htmlTreeParse(url, useInternalNodes = TRUE)
nodes <- getNodeSet(doc, "//tr/td[@class='county-name']/text()")
tmp <- sapply(nodes, xmlValue)
county <- sapply(tmp, function(x) clean(x, num=FALSE))
return(county)
}
You can learn about XPath here.
Another example: grab all R package names from the Crantastic timeline. This looks for a div
node with the id
"timeline", then looks for the ul
with the class "timeline", and extracts all of the first a
nodes from the parent node, and returns their text:
url <- 'http://crantastic.org/'
doc = htmlTreeParse(url, useInternalNodes = TRUE)
nodes <- getNodeSet(doc, "//div[@id='timeline']/ul[@class='timeline']/li/a[1]/text()")
tmp <- sapply(nodes, xmlValue)
tmp
> [1] "landis" "vegan" "mutossGUI" "lordif"
[5] "futile.paradigm" "lme4" "tm" "qpcR"
[9] "igraph" "aspace" "ade4" "MCMCglmm"
[13] "hts" "emdbook" "DCGL" "wq"
[17] "crantastic" "Psychometrics" "crantastic" "gR"
[21] "crantastic" "Distributions" "rAverage" "spikeslab"
[25] "sem"