I'm scraping the following site: http://en.wikipedia.org/wiki/List_of_Presidents_of_the_United_States
Let's say I'm interested in scraping the 4th President - I can see from the table that it's "James Madison". Using a Chrome browser, I can quickly identify the Xpath (Inspect element, Copy XPath). That gives me: "//*[@id='mw-content-text']/table[1]/tbody/tr[7]/td[2]/b/a". However, that does not work with R:
library(XML)
url <- "http://en.wikipedia.org/wiki/List_of_Presidents_of_the_United_States"
html <- htmlTreeParse(url,useInternalNodes=T)
xpath <- paste("//*[@id='mw-content-text']/table[1]/tbody/tr[7]/td[2]/b/a",sep="")
xpathSApply(html, xpath, xmlValue)
Returns NULL. The correct XPath to use here is "//*[@id='mw-content-text']/table[1]/tr[7]/td[2]/b/a". So my questions are:
- How can I change settings in R, so that R sees the same XPath as my Chrome browser? I beleive it's something to do with the http user agent? This post asked a similar question but the answer didn't provide enough detail.
- If this is not possible, how can I use the XML package to quickly identify the correct XPath which leads to "James Madison"? i.e. "//*[@id='mw-content-text']/table[1]/tr[7]/td[2]/b/a"
Thanks!