This is a solution using the tidyvere to scrape this website. But first we check the robots.txt file of the website to get a sense of the limit rate for request. See for reference the post Analyzing “Crawl-Delay” Settings in Common Crawl robots.txt Data with R for further info.
library(spiderbar)
library(robotstxt)
rt <- robxp(get_robotstxt("https://www.basketball-reference.com"))
crawl_delays(rt)
#> agent crawl_delay
#> 1 * 3
#> 2 ahrefsbot -1
#> 3 twitterbot -1
#> 4 slysearch -1
#> 5 ground-control -1
#> 6 groundcontrol -1
#> 7 matrix -1
#> 8 hal9000 -1
#> 9 carmine -1
#> 10 the-matrix -1
#> 11 skynet -1
We are interested by the *
value. We see we have to wait a minimum of 3 sec between requests. We will took 5 secondes.
We use the tidyverse
ecosystem to build the urls and iterate through them to get a table with all the data.
library(tidyverse)
library(rvest)
#> Le chargement a nécessité le package : xml2
#>
#> Attachement du package : 'rvest'
#> The following object is masked from 'package:purrr':
#>
#> pluck
#> The following object is masked from 'package:readr':
#>
#> guess_encoding
month_sub <- c("october", "november", "december", "january")
urls <- map_chr(month_sub, ~ paste0("https://www.basketball-reference.com/leagues/NBA_2018_games-", .,".html"))
urls
#> [1] "https://www.basketball-reference.com/leagues/NBA_2018_games-october.html"
#> [2] "https://www.basketball-reference.com/leagues/NBA_2018_games-november.html"
#> [3] "https://www.basketball-reference.com/leagues/NBA_2018_games-december.html"
#> [4] "https://www.basketball-reference.com/leagues/NBA_2018_games-january.html"
pb <- progress_estimated(length(urls))
map(urls, ~{
url <- .
pb$tick()$print()
Sys.sleep(5) # we take 5sec
tables <- read_html(url) %>%
# we select the table part by its table id tag
html_nodes("#schedule") %>%
# we extract the table
html_table() %>%
# we get a 1 element list so we take flatten to get a tibble
flatten_df()
}) -> tables
# we get a list of tables, one per month
str(tables, 1)
#> List of 4
#> $ :Classes 'tbl_df', 'tbl' and 'data.frame': 104 obs. of 8 variables:
#> $ :Classes 'tbl_df', 'tbl' and 'data.frame': 213 obs. of 8 variables:
#> $ :Classes 'tbl_df', 'tbl' and 'data.frame': 227 obs. of 8 variables:
#> $ :Classes 'tbl_df', 'tbl' and 'data.frame': 216 obs. of 8 variables:
# we can get all the data in one table by binding rows.
# As we saw on the website that there are 2 empty columns with no names,
# we need to take care of it with repair_name before row binding
res <- tables %>%
map_df(tibble::repair_names)
res
#> # A tibble: 760 x 8
#> Date `Start (ET)` `Visitor/Neutral` PTS
#> <chr> <chr> <chr> <int>
#> 1 Tue, Oct 17, 2017 8:01 pm Boston Celtics 102
#> 2 Tue, Oct 17, 2017 10:30 pm Houston Rockets 121
#> 3 Wed, Oct 18, 2017 7:30 pm Milwaukee Bucks 100
#> 4 Wed, Oct 18, 2017 8:30 pm Atlanta Hawks 111
#> 5 Wed, Oct 18, 2017 7:00 pm Charlotte Hornets 102
#> 6 Wed, Oct 18, 2017 7:00 pm Brooklyn Nets 140
#> 7 Wed, Oct 18, 2017 8:00 pm New Orleans Pelicans 103
#> 8 Wed, Oct 18, 2017 7:00 pm Miami Heat 116
#> 9 Wed, Oct 18, 2017 10:00 pm Portland Trail Blazers 76
#> 10 Wed, Oct 18, 2017 10:00 pm Houston Rockets 100
#> # ... with 750 more rows, and 4 more variables: `Home/Neutral` <chr>,
#> # V1 <chr>, V2 <chr>, Notes <lgl>