rvest: for loop/map to pull multiple tables using html_node & html_table

Question

I'm trying to programmatically pull all of the box scores for a given day from NBA Reference (I used January 4th, 2020 which has multiple games). I started by creating a list of integers to denote the amount of box scores to pull:

games<- c(1:3)

Then I used developer tools from my browser to determine what each table contains (you can use selector gadget):

#content > div.game_summaries > div:nth-child(1) > table.team

Then I used purrr::map to create a list of the the tables to pull, using games:

map_list<- map(.x= '', paste, '#content > div.game_summaries > div:nth-child(', games, ') > table.teams', 
           sep = "") 
# check map_list
map_list

Then I tried to run this list through a for loop to generate three tables, using tidyverse and rvest, which delivered an error:

for (i in map_list){
read_html('https://www.basketball-reference.com/boxscores/') %>% 
  html_node(map_list[[1]][i]) %>% 
  html_table() %>% 
  glimpse()
}

Error in selectr::css_to_xpath(css, prefix = ".//") : 
  Zero length character vector found for the following argument: selector
In addition: Warning message:
In selectr::css_to_xpath(css, prefix = ".//") :
  NA values were found in the 'selector' argument, they have been removed

For reference, if I explicitly denote the html or call the exact item from map_list, the code works as intended (run below items for reference):

read_html('https://www.basketball-reference.com/boxscores/') %>% 
  html_node('#content > div.game_summaries > div:nth-child(1) > table.teams') %>% 
  html_table() %>% 
  glimpse()

read_html('https://www.basketball-reference.com/boxscores/') %>% 
  html_node(map_list[[1]][1]) %>% 
  html_table() %>% 
  glimpse()

How do I make this work with a list? I have looked at other threads but even though they use the same site, they're not the same issue.

Ronak Shah · Accepted Answer · 2020-01-06T03:23:39.930

Using your current map_list, if you want to use for loop this is what you should use

library(rvest)

for (i in seq_along(map_list[[1]])){
  read_html('https://www.basketball-reference.com/boxscores/') %>% 
   html_node(map_list[[1]][i]) %>% 
   html_table() %>% 
   glimpse()
}

but I think this is simpler as you don't need to use map to create map_list since paste is vectorized :

map_list<- paste0('#content > div.game_summaries > div:nth-child(', games, ') > table.teams')
url <- 'https://www.basketball-reference.com/boxscores/'
webpage <- url %>% read_html()

purrr::map(map_list, ~webpage %>% html_node(.x) %>% html_table)

#[[1]]
#       X1  X2    X3
#1 Indiana 111 Final
#2 Atlanta 116      

#[[2]]
#        X1  X2    X3
#1  Toronto 121 Final
#2 Brooklyn 102      

#[[3]]
#       X1  X2    X3
#1  Boston 111 Final
#2 Chicago 104

Dave2e · Answer 2 · 2021-05-11T22:23:14.787

This page is reasonably straight forward to scrape. Here is a possible solution, first scrape the game summary nodes "div with class=game_summary". This provides a list of all of the games played. Also this allows the use of html_node function which guarantees a return, thus keeping the list sizes equal.

Each game summary is made up of three subtables, the first and third tables can be scraped directly. The second table does not have a class assigned thus making it more tricky to retrieve.

library(rvest)

page <- read_html('https://www.basketball-reference.com/boxscores/')

#find all of the game summaries on the page
games<-page %>% html_nodes("div.game_summary")

#Each game summary has 3 sub tables
#game score is table 1 of class=teams
#the stats is table 3 of class=stats
# the quarterly score is the second table and does not have a class defined
  table1<-games %>%  html_node("table.teams") %>% html_table()
  stats <-games %>%  html_node("table.stats") %>% html_table()
  quarter<-sapply(games, function(g){
                      g %>%  html_nodes("table") %>% .[2] %>% html_table()
                  })

rvest: for loop/map to pull multiple tables using html_node & html_table

2 Answers2