I'm attempting to learn webscraping using rvest
and am trying to reproduce the example given here:
https://www.r-bloggers.com/using-rvest-to-scrape-an-html-table/
Having installed rvest
, I simply copy-pasted the code given in the article:
library("rvest")
url <- "http://en.wikipedia.org/wiki/List_of_U.S._states_and_territories_by_population"
population <- url %>%
read_html() %>%
html_nodes(xpath='//*[@id="mw-content-text"]/table[1]') %>%
html_table()
population <- population[[1]]
The only difference is that I use read_html()
rather than html()
, since the latter is deprecated.
Rather than the output reported in the article, this code yields the familiar:
Error in population[[1]] : subscript out of bounds
The origin of which is that running the code without final two lines gives population
a value of {xml_nodeset (0)}
All of the previous questions regarding this suggest that this is caused by the table being dynamically formatted in javascript. But this is not the case here (unless Wikipedia has change its formatting since the rbloggers article in 2015).
Any insight would be much appreciated since I'm at a loss!