The table I am interested in is the Wikipedia table of Michelin-starred restaurants in NYC, and the number of stars awarded is indicated by pictures.
I was able to scrape the table using two steps (first get the words in the "Name" and "Borough" columns, second get the alt tags in the table body), but I want to know if it can be done in one step. I was able to scrape the data using the rvest package.
Since wikipedia pages can't be read by the XML::readHTMLTable function, I tried the htmltab package with no luck, because I couldn't figure out the function needed for the bodyFun argument. Truth be told, I am a newbie to web scraping...and functions.
Questions I referred to for reference:
Scraping html table with images using XML R package
Scraping html tables into R data frames using the XML package
Here is my code:
library(stringr)
library(rvest)
library(data.table)
url <- "http://en.wikipedia.org/wiki/List_of_Michelin_starred_restaurants_in_New_York_City"
#Scrape the first two columns, restaurant name and borough
name.boro <- url %>% read_html() %>% html_nodes("table") %>% html_table(fill = TRUE)
name.boro <- as.data.table(name.boro[[1]])
name.boro[, 3:length(name.boro) := NULL]
135 * 13 #1,755 cells in first table
#scrape tables for img alt
#note that because I used the "td" node, entries for all cells in all tables were pulled
stars <- url %>% read_html() %>% html_nodes("td") %>% html_node("img") %>% html_attr("alt")
stars
#Make vector of numbers to index each column
df <- vector("list", 13)
for (i in 1:13){
df[[i]] <- seq(i, 1755, 13)
}
#Put everything together
Mich.Guide <- name.boro
Mich.Guide[, c("X2006", "X2007", "X2008", "X2009", "X2010", "X2011", "X2012", "X2013", "X2014", "X2015",
"X2016") := .(stars[unlist(df[3])], stars[unlist(df[4])], stars[unlist(df[5])],
stars[unlist(df[6])], stars[unlist(df[7])], stars[unlist(df[8])],
stars[unlist(df[9])], stars[unlist(df[10])], stars[unlist(df[11])],
stars[unlist(df[12])], stars[unlist(df[13])] )]
Thank you!