21

I regularly extract tables from Wikipedia. Excel's web import does not work properly for wikipedia, as it treats the whole page as a table. In google spreadsheet, I can enter this:

=ImportHtml("http://en.wikipedia.org/wiki/Upper_Peninsula_of_Michigan","table",3)

and this function will download the 3rd table, which lists all the counties of the UP of Michigan, from that page.

Is there something similar in R? or can be created via a user defined function?

karlos
  • 873
  • 2
  • 10
  • 21
  • 1
    Possible Duplicate http://stackoverflow.com/questions/1395528/scraping-html-tables-into-r-data-frames-using-the-xml-package – Ramnath Sep 13 '11 at 20:51
  • 1
    @DWin simple, yes; but repetitive/reproducible? no. isn't one script to do all nice? – karlos Sep 13 '11 at 20:55
  • @Ramnath I had not seen that thread, but the solution provided in that thread does work: readHTMLTable(theurl) and tables[3]. thanks for sharing that. will have to figure out how to convert the result to a proper frame – karlos Sep 13 '11 at 21:02

6 Answers6

14

Building on Andrie's answer, and addressing SSL. If you can take one additional library dependency:

library(httr)
library(XML)

url <- "https://en.wikipedia.org/wiki/Upper_Peninsula_of_Michigan"

r <- GET(url)

doc <- readHTMLTable(
  doc=content(r, "text"))

doc[6]
schnee
  • 1,050
  • 2
  • 9
  • 20
13

The function readHTMLTable in package XML is ideal for this.

Try the following:

library(XML)
doc <- readHTMLTable(
         doc="http://en.wikipedia.org/wiki/Upper_Peninsula_of_Michigan")

doc[[6]]

            V1         V2                 V3                              V4
1       County Population Land Area (sq mi) Population Density (per sq mi)
2        Alger      9,862                918                            10.7
3       Baraga      8,735                904                             9.7
4     Chippewa     38,413               1561                            24.7
5        Delta     38,520               1170                            32.9
6    Dickinson     27,427                766                            35.8
7      Gogebic     17,370               1102                            15.8
8     Houghton     36,016               1012                            35.6
9         Iron     13,138               1166                            11.3
10    Keweenaw      2,301                541                             4.3
11        Luce      7,024                903                             7.8
12    Mackinac     11,943               1022                            11.7
13   Marquette     64,634               1821                            35.5
14   Menominee     25,109               1043                            24.3
15   Ontonagon      7,818               1312                             6.0
16 Schoolcraft      8,903               1178                             7.6
17       TOTAL    317,258             16,420                            19.3

readHTMLTable returns a list of data.frames for each element of the HTML page. You can use names to get information about each element:

> names(doc)
 [1] "NULL"                                                                               
 [2] "toc"                                                                                
 [3] "Election results of the 2008 Presidential Election by County in the Upper Peninsula"
 [4] "NULL"                                                                               
 [5] "Cities and Villages of the Upper Peninsula"                                         
 [6] "Upper Peninsula Land Area and Population Density by County"                         
 [7] "19th Century Population by Census Year of the Upper Peninsula by County"            
 [8] "20th & 21st Centuries Population by Census Year of the Upper Peninsula by County"   
 [9] "NULL"                                                                               
[10] "NULL"                                                                               
[11] "NULL"                                                                               
[12] "NULL"                                                                               
[13] "NULL"                                                                               
[14] "NULL"                                                                               
[15] "NULL"                                                                               
[16] "NULL" 
Andrie
  • 176,377
  • 47
  • 447
  • 496
  • 3
    I tried the code `readHTMLTable(doc = "https://en.wikipedia.org/wiki/Gross_domestic_product")` and got `XML content does not seem to be XML:` I'm guessing that the `https` can be the problem, how to work around it? – Konrad Jul 06 '15 at 22:27
  • 6
    This solution no longer works after Wikipedia moved to secured connection. Any clue how to get it to work? – Shambho Jan 15 '16 at 01:21
  • 1
    See `schnee`'s answer to this question which addresses https – Paul James Apr 23 '19 at 02:03
10

Here is a solution that works with the secure (https) link:

install.packages("htmltab")
library(htmltab)
htmltab("http://en.wikipedia.org/wiki/Upper_Peninsula_of_Michigan",3)
Shambho
  • 3,250
  • 1
  • 24
  • 37
2

One simple way to do it is to use the RGoogleDocs interface to have Google Docs to do the conversion for you:

http://www.omegahat.org/RGoogleDocs/run.html

You can then use the =ImportHtml Google Docs function with all its pre-built magic.

Ari B. Friedman
  • 71,271
  • 35
  • 175
  • 235
1

A tidyverse solution using rvest. It's very useful if you need to find the table based on some keywords, for example in the table headers. Here is an example where we want to get the table on Vital statistics of Egypt. Note: html_nodes(x = page, css = "table") is a useful way to browse available tables on the page.

library(magrittr)
library(rvest)

# define the page to load
read_html("https://en.wikipedia.org/wiki/Demographics_of_Egypt") %>% 
    # list all tables on the page
    html_nodes(css = "table") %>% 
    # select the one containing needed key words
    extract2(., str_which(string = . , pattern = "Live births")) %>% 
    # convert to a table
    html_table(fill = T) %>%  
    view
ikashnitsky
  • 2,941
  • 1
  • 25
  • 43
1

That table is the only table which is a child of the second td child of so you can specify that pattern with css. Rather than use a type selector of table to grab the child table you can use the class which is faster:

library(rvest)

t <- read_html('https://en.wikipedia.org/wiki/Upper_Peninsula_of_Michigan') %>% 
  html_node('td:nth-child(2) .wikitable') %>% 
  html_table()

print(t)
QHarr
  • 83,427
  • 12
  • 54
  • 101