Importing wikipedia tables in R

Question

I regularly extract tables from Wikipedia. Excel's web import does not work properly for wikipedia, as it treats the whole page as a table. In google spreadsheet, I can enter this:

=ImportHtml("http://en.wikipedia.org/wiki/Upper_Peninsula_of_Michigan","table",3)

and this function will download the 3rd table, which lists all the counties of the UP of Michigan, from that page.

Is there something similar in R? or can be created via a user defined function?

Possible Duplicate http://stackoverflow.com/questions/1395528/scraping-html-tables-into-r-data-frames-using-the-xml-package — Ramnath, Sep 13 '11 at 20:51
@DWin simple, yes; but repetitive/reproducible? no. isn't one script to do all nice? — karlos, Sep 13 '11 at 20:55
@Ramnath I had not seen that thread, but the solution provided in that thread does work: readHTMLTable(theurl) and tables[3]. thanks for sharing that. will have to figure out how to convert the result to a proper frame — karlos, Sep 13 '11 at 21:02

score 14 · Answer 1 · answered Feb 01 '17 at 16:27

Building on Andrie's answer, and addressing SSL. If you can take one additional library dependency:

library(httr)
library(XML)

url <- "https://en.wikipedia.org/wiki/Upper_Peninsula_of_Michigan"

r <- GET(url)

doc <- readHTMLTable(
  doc=content(r, "text"))

doc[6]

score 13 · Accepted Answer · answered Sep 14 '11 at 06:40

The function readHTMLTable in package XML is ideal for this.

Try the following:

library(XML)
doc <- readHTMLTable(
         doc="http://en.wikipedia.org/wiki/Upper_Peninsula_of_Michigan")

doc[[6]]

            V1         V2                 V3                              V4
1       County Population Land Area (sqÂ mi) Population Density (per sqÂ mi)
2        Alger      9,862                918                            10.7
3       Baraga      8,735                904                             9.7
4     Chippewa     38,413               1561                            24.7
5        Delta     38,520               1170                            32.9
6    Dickinson     27,427                766                            35.8
7      Gogebic     17,370               1102                            15.8
8     Houghton     36,016               1012                            35.6
9         Iron     13,138               1166                            11.3
10    Keweenaw      2,301                541                             4.3
11        Luce      7,024                903                             7.8
12    Mackinac     11,943               1022                            11.7
13   Marquette     64,634               1821                            35.5
14   Menominee     25,109               1043                            24.3
15   Ontonagon      7,818               1312                             6.0
16 Schoolcraft      8,903               1178                             7.6
17       TOTAL    317,258             16,420                            19.3

readHTMLTable returns a list of data.frames for each element of the HTML page. You can use names to get information about each element:

> names(doc)
 [1] "NULL"                                                                               
 [2] "toc"                                                                                
 [3] "Election results of the 2008 Presidential Election by County in the Upper Peninsula"
 [4] "NULL"                                                                               
 [5] "Cities and Villages of the Upper Peninsula"                                         
 [6] "Upper Peninsula Land Area and Population Density by County"                         
 [7] "19th Century Population by Census Year of the Upper Peninsula by County"            
 [8] "20th & 21st Centuries Population by Census Year of the Upper Peninsula by County"   
 [9] "NULL"                                                                               
[10] "NULL"                                                                               
[11] "NULL"                                                                               
[12] "NULL"                                                                               
[13] "NULL"                                                                               
[14] "NULL"                                                                               
[15] "NULL"                                                                               
[16] "NULL"

I tried the code `readHTMLTable(doc = "https://en.wikipedia.org/wiki/Gross_domestic_product")` and got `XML content does not seem to be XML:` I'm guessing that the `https` can be the problem, how to work around it? — Konrad, Jul 06 '15 at 22:27
This solution no longer works after Wikipedia moved to secured connection. Any clue how to get it to work? — Shambho, Jan 15 '16 at 01:21
See `schnee`'s answer to this question which addresses https — Paul James, Apr 23 '19 at 02:03

score 10 · Answer 3 · answered Jan 15 '16 at 01:30

10

Here is a solution that works with the secure (https) link:

install.packages("htmltab")
library(htmltab)
htmltab("http://en.wikipedia.org/wiki/Upper_Peninsula_of_Michigan",3)

answered Jan 15 '16 at 01:30

Shambho

3,250
1
24
37

score 2 · Answer 4 · answered Sep 13 '11 at 20:12

2

One simple way to do it is to use the RGoogleDocs interface to have Google Docs to do the conversion for you:

http://www.omegahat.org/RGoogleDocs/run.html

You can then use the =ImportHtml Google Docs function with all its pre-built magic.

answered Sep 13 '11 at 20:12

Ari B. Friedman

71,271
35
175
235

score 1 · Answer 5 · answered Mar 17 '21 at 10:59

A tidyverse solution using rvest. It's very useful if you need to find the table based on some keywords, for example in the table headers. Here is an example where we want to get the table on Vital statistics of Egypt. Note: html_nodes(x = page, css = "table") is a useful way to browse available tables on the page.

library(magrittr)
library(rvest)

# define the page to load
read_html("https://en.wikipedia.org/wiki/Demographics_of_Egypt") %>% 
    # list all tables on the page
    html_nodes(css = "table") %>% 
    # select the one containing needed key words
    extract2(., str_which(string = . , pattern = "Live births")) %>% 
    # convert to a table
    html_table(fill = T) %>%  
    view

score 1 · Answer 6 · answered Mar 17 '21 at 21:01

That table is the only table which is a child of the second td child of so you can specify that pattern with css. Rather than use a type selector of table to grab the child table you can use the class which is faster:

library(rvest)

t <- read_html('https://en.wikipedia.org/wiki/Upper_Peninsula_of_Michigan') %>% 
  html_node('td:nth-child(2) .wikitable') %>% 
  html_table()

print(t)

Importing wikipedia tables in R

6 Answers6

Linked