Scraping Wikipedia HTML table with images, text, and blank cells with R

Question

The table I am interested in is the Wikipedia table of Michelin-starred restaurants in NYC, and the number of stars awarded is indicated by pictures.

I was able to scrape the table using two steps (first get the words in the "Name" and "Borough" columns, second get the alt tags in the table body), but I want to know if it can be done in one step. I was able to scrape the data using the rvest package.

Since wikipedia pages can't be read by the XML::readHTMLTable function, I tried the htmltab package with no luck, because I couldn't figure out the function needed for the bodyFun argument. Truth be told, I am a newbie to web scraping...and functions.

Questions I referred to for reference:

Scraping html table with images using XML R package

Scraping html tables into R data frames using the XML package

Here is my code:

library(stringr)
library(rvest)
library(data.table)

url <- "http://en.wikipedia.org/wiki/List_of_Michelin_starred_restaurants_in_New_York_City"

#Scrape the first two columns, restaurant name and borough
name.boro <- url %>% read_html() %>% html_nodes("table") %>% html_table(fill = TRUE)
name.boro <- as.data.table(name.boro[[1]])
name.boro[, 3:length(name.boro) := NULL]
135 * 13 #1,755 cells in first table

#scrape tables for img alt 
#note that because I used the "td" node, entries for all cells in all tables were pulled
stars <- url %>% read_html() %>% html_nodes("td") %>% html_node("img") %>% html_attr("alt")
stars 

#Make vector of numbers to index each column
df <- vector("list", 13)
for (i in 1:13){
  df[[i]] <- seq(i, 1755, 13)
}

#Put everything together
Mich.Guide <- name.boro 
Mich.Guide[, c("X2006", "X2007", "X2008", "X2009", "X2010", "X2011", "X2012", "X2013", "X2014", "X2015", 
               "X2016") := .(stars[unlist(df[3])], stars[unlist(df[4])], stars[unlist(df[5])], 
                             stars[unlist(df[6])], stars[unlist(df[7])], stars[unlist(df[8])], 
                             stars[unlist(df[9])], stars[unlist(df[10])], stars[unlist(df[11])], 
                             stars[unlist(df[12])], stars[unlist(df[13])] )]

Thank you!

_"Since wikipedia pages can't be read by the XML package…"_ => pls explain this mis-truth. — hrbrmstr, Aug 06 '16 at 20:47
the following threw an error: `url <- "http://en.wikipedia.org/wiki/List_of_Michelin_starred_restaurants_in_New_York_City"` `readHTMLTable(url, which=1)` ... so I looked it up here [http://stackoverflow.com/questions/7407735/importing-wikipedia-tables-in-r](http://stackoverflow.com/questions/7407735/importing-wikipedia-tables-in-r) and user Shambho commented that the secure connection doesn't work in the package. Are you able to use the readHTMLTable command on that site? — MiamiCG, Aug 06 '16 at 23:12
Saying `readHTMLTable()` doesn't work == "cannot use the XML package" is a bit disingenuous — hrbrmstr, Aug 06 '16 at 23:26
That function works fine on other sites so I can't say that it "doesn't work." In the package documentation for the `readHTMLTable` wikipedia is used as an example with the note the secure connection is unsupported as of last year. Either way, I made the language more clear in an edit. — MiamiCG, Aug 06 '16 at 23:47
The `XML` solution has been added to my answer only using base R apart from `RCurl` & `stringi`. — hrbrmstr, Aug 06 '16 at 23:49
OK, thanks! It wasn't obvious to me why one function of a package would be able to read a URL and another wouldn't, but I looked at the the table one and the parse one you used and see that the arguments differ. — MiamiCG, Aug 07 '16 at 00:01

score 3 · Accepted Answer · answered Aug 06 '16 at 20:05

You can try the following

require(rvest)
url <- "http://en.wikipedia.org/wiki/List_of_Michelin_starred_restaurants_in_New_York_City"
doc <- read_html(url)
col_names <- doc %>% html_nodes("#mw-content-text > table > tr:nth-child(1) > th") %>% html_text()
tbody <- doc %>% html_nodes("#mw-content-text > table > tr:not(:first-child)")

extract_tr <- function(tr){
  scope <- tr %>% html_children()
  c(scope[1:2] %>% html_text(),
    scope[3:length(scope)] %>% html_node("img") %>% html_attr("alt"))
}

res <- tbody %>% sapply(extract_tr)
res <- as.data.frame(t(res), stringsAsFactors = FALSE)
colnames(res) <- col_names

Now you have the raw-table. I leave the parsing of the columns to integer and the column-names to you

Thank you - I think this solution is more universal. I hope one day a package would have a shorter solution wrapped up in a function! — MiamiCG, Aug 08 '16 at 22:12

hrbrmstr · Answer 2 · 2016-08-06T23:49:22.303

Slightly different approach:

library(rvest)
library(purrr)
library(stringi)

pg <- read_html("http://en.wikipedia.org/wiki/List_of_Michelin_starred_restaurants_in_New_York_City")
html_nodes(pg, xpath=".//table[contains(@class, 'wikitable')]/tr[not(th)]") %>% 
  map_df(function(x) {
    r_name <- html_text(html_nodes(x, xpath=".//td[1]"))
    borough <- html_text(html_nodes(x, xpath=".//td[2]"))
    map(3:13, function(y) {
      stars <- html_attr(html_nodes(x, xpath=sprintf(".//td[%d]/a", y)), "href")
      if (length(stars)==0) {
        NA
      } else {
        stri_match_first_regex(stars, "Michelin-([[:digit:]])")[,2] 
      }
    }) -> refs
    refs <- setNames(refs, c(2006:2016))
    as.data.frame(c(r_name=r_name, borough=borough, refs), stringsAsFactors=FALSE)
  }) -> michelin_nyc

dplyr::glimpse(michelin_nyc)

## Observations: 135
## Variables: 13
## $ r_name  <chr> "Adour", "Ai Fiori", "Alain Ducasse at the...
## $ borough <chr> "Manhattan", "Manhattan", "Manhattan", "Ma...
## $ X2006   <chr> NA, NA, "3", NA, NA, NA, NA, "1", NA, NA, ...
## $ X2007   <chr> NA, NA, NA, NA, NA, NA, NA, "1", NA, NA, N...
## $ X2008   <chr> NA, NA, NA, NA, NA, NA, NA, "1", "1", NA, ...
## $ X2009   <chr> "2", NA, NA, NA, "1", "1", NA, "1", "1", N...
## $ X2010   <chr> "1", NA, NA, NA, NA, "2", NA, "1", "1", NA...
## $ X2011   <chr> "1", NA, NA, "1", NA, "2", NA, "1", "1", N...
## $ X2012   <chr> "1", "1", NA, "1", NA, NA, NA, "1", NA, NA...
## $ X2013   <chr> "1", "1", NA, "1", NA, NA, NA, "1", NA, "1...
## $ X2014   <chr> NA, "1", NA, "1", NA, NA, NA, "1", NA, "1"...
## $ X2015   <chr> NA, "1", NA, "1", NA, NA, "1", NA, NA, "2"...
## $ X2016   <chr> NA, "1", NA, "1", NA, NA, "1", NA, NA, "2"...

which is also totally doable with the XML package as you can see below:

library(XML)
library(RCurl)
library(stringi)

pg <- getURL("https://en.wikipedia.org/wiki/List_of_Michelin_starred_restaurants_in_New_York_City")
pg <- htmlParse(pg)
rows <- getNodeSet(pg, "//table[contains(@class, 'wikitable')]/tr[not(th)]")
do.call(rbind, lapply(rows, function(x) {
  r_name <- xpathSApply(x, ".//td[1]", xmlValue)
  borough <- xpathSApply(x, ".//td[2]", xmlValue)
  lapply(3:13, function(y) {
    stars <- xpathSApply(x, sprintf(".//td[%d]/a", y), xmlGetAttr, "href")
    if (length(stars)==0) {
      NA
    } else {
      stri_match_first_regex(stars, "Michelin-([[:digit:]])")[,2] 
    }
  }) -> refs
  refs <- setNames(refs, c(2006:2016))
  as.data.frame(c(r_name=r_name, borough=borough, refs), stringsAsFactors=FALSE)
})) -> michelin_nyc

str(michelin_nyc)

## 'data.frame': 135 obs. of  13 variables:
##  $ r_name : chr  "Adour" "Ai Fiori" "Alain Ducasse at the Essex House" "Aldea" ...
##  $ borough: chr  "Manhattan" "Manhattan" "Manhattan" "Manhattan" ...
##  $ X2006  : chr  NA NA "3" NA ...
##  $ X2007  : chr  NA NA NA NA ...
##  $ X2008  : chr  NA NA NA NA ...
##  $ X2009  : chr  "2" NA NA NA ...
##  $ X2010  : chr  "1" NA NA NA ...
##  $ X2011  : chr  "1" NA NA "1" ...
##  $ X2012  : chr  "1" "1" NA "1" ...
##  $ X2013  : chr  "1" "1" NA "1" ...
##  $ X2014  : chr  NA "1" NA "1" ...
##  $ X2015  : chr  NA "1" NA "1" ...
##  $ X2016  : chr  NA "1" NA "1" ...

Thank you for doing it in RVest and XML! The output is very clean. — MiamiCG, Aug 08 '16 at 22:13

Scraping Wikipedia HTML table with images, text, and blank cells with R

2 Answers2

Linked