3

I'm having trouble following the selected answer to this question. The table I'm trying to scrape is this list of U.S. state populations.

library(XML)
theurl <- "http://en.wikipedia.org/wiki/List_of_U.S._states_and_territories_by_population"
tables <- readHTMLTable(theurl)
n.rows <- unlist(lapply(tables, function(t) dim(t)[1]))

This is the error I'm getting..

Error: failed to load external entity "http://en.wikipedia.org/wiki/List_of_U.S._states_and_territories_by_population"

What gives?

(Note - although I'm looking to resolve this error, if you can point me to an easier way of getting population data I'd appreciate it.)

Community
  • 1
  • 1
Ben
  • 20,038
  • 30
  • 112
  • 189
  • Wikipedia allow the download of their entire db for free... https://en.wikipedia.org/wiki/Wikipedia:Database_download That should put less strain on already maxed out webservers – ScottMcGready Sep 02 '15 at 00:34
  • 4
    err, you could follow the reference link for the data in question, found at the bottom of the page, and go to [the reference site](http://www.census.gov/popest/data/state/totals/2013/index.html), also known as the census, and download the csv or xls contained therein. – Shawn Mehan Sep 02 '15 at 00:37
  • 1
    @ScottMcGready, you must have a big external HD. :) that's not a small download you're proposing there, just for a table of 50 rows with a couple of columns of interest. – Shawn Mehan Sep 02 '15 at 01:12
  • @ShawnMehan maybe ... – ScottMcGready Sep 02 '15 at 01:13
  • this is prbly a good (and small) data URL: https://www.census.gov/popest/data/state/totals/2014/tables/NST-EST2014-01.csv (ref: https://www.census.gov/popest/data/state/totals/2014/) – hrbrmstr Sep 02 '15 at 01:57
  • 3
    also, the simple english wikipedia is usually easier to scrape in my experience: https://simple.wikipedia.org/wiki/List_of_U.S._states_by_population – hrbrmstr Sep 02 '15 at 02:00

2 Answers2

2

There is nothing wrong with your code. There is, however, something wrong with your URL.

You can test this by going to a shell and attempting to verify that the external inputs into your code are not causing it to fail, e.g.,

curl https://en.wikipedia.org/wiki/List_of_U.S._states_and_territories_by_population

which will return an empty body, similar to your R code. This should lead you to believe that it isn't your R code that is faulty. Upon making this discovery, you might proceed to the section in the page in which you are interested, again using your free and easy test environment in curl, and run

curl https://en.wikipedia.org/wiki/List_of_U.S._states_and_territories_by_population#States_and_territories

which will most definitely not return an empty result:

...
<body class="mediawiki ltr sitedir-ltr ns-0 ns-subject page-List_of_U_S_states_and_territories_by_population skin-vector action-view">
    <div id="mw-page-base" class="noprint"></div>
    <div id="mw-head-base" class="noprint"></div>
    <div id="content" class="mw-body" role="main">
Shawn Mehan
  • 4,513
  • 9
  • 31
  • 51
1

This is pretty easy to do in rvest

library(rvest); library(magrittr) # for %>%

theurl %>%
  html() %>%
  html_nodes("table") %>% extract(1) %>%
  html_table(fill=TRUE) %>% extract(1) -> pop_table

See @Cory's blog for more info.

C8H10N4O2
  • 18,312
  • 8
  • 98
  • 134