0

Evening all, I'm having a few issues at the moment scraping data from multiple web pages.

library(RCurl)
library(XML)
tables <- readHTMLTable(getURL("https://www.basketball-reference.com/leagues/NBA_2018_games.html"))

 for (i in c("october", "november", "december", "january")) {
   readHTMLTable(getURL(paste0("https://www.basketball-reference.com/leagues/NBA_2018_games-",i,".html")))
   regular <- tables[["schedule"]]
   write.csv(regular, file = paste0("./", i, i, ".csv"))
  }

I'm having an issue where it doesn't appear to be looping through the months and is just saving 4 files from october. Any help appreciated.

Jay Raúl
  • 23
  • 4
  • 3
    You have object `tables` stored in your environment (it's not being overwritten). You have to redirect `readHTMLTable()` output to `tables` (`tables <- readHTMLTable()` ) – pogibas Jan 02 '18 at 18:23
  • 1
    Hi! Welcome, but please, post a reproducible example. https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example – Icaro Bombonato Jan 02 '18 at 18:24
  • Thanks I've updated the opening post. – Jay Raúl Jan 02 '18 at 18:27
  • You should have read the first comment. After making the change it works as programmed, although the file names now look like: "januaryjanuary.csv". – IRTFM Jan 02 '18 at 18:40
  • 2
    You also need to add a timeout to make this process conform to the TOS: `you agree not to use or launch any automated system, including without limitation, robots, spiders, offline readers, or like devices, that accesses the Site in a manner which sends more request messages to the Site server in any given period of time than a typical human would normally produce in the same period by using a conventional on-line Web browser to read.` – IRTFM Jan 02 '18 at 18:42
  • Many thanks everyone, and @42- I've now added that into my code. Thanks. – Jay Raúl Jan 02 '18 at 18:54
  • can you give my answer as good @JayRaúl? thank you – Henry Navarro Jan 05 '18 at 14:02

2 Answers2

0

this is not the most elegant way but it works good.

Hope help you.

Code to web scraping

rm(list = ls())

if(!require("rvest")){install.packages("rvest");library("rvest")}


for (i in c("october", "november", "december", "january")) {
 

nba_url <- read_html(paste0("https://www.basketball-reference.com/leagues/NBA_2018_games-",i,".html"))


#Left part of the table
left<-nba_url %>%
  html_nodes(".left") %>% #item de precios
  html_text()
left<-left[-length(left)]
left<-left[-(1:4)]

#Assign specific values
Date<-left[seq(1,length(left),4)]
Visitor<-left[seq(2,length(left),4)]
Home<-left[seq(3,length(left),4)]


#Right part of the table
right<-nba_url %>%
  html_nodes(".right") %>% #item de precios
  html_text()
right<-right[-length(right)]
right<-right[-(1:2)]

#Assign specific values
Start<-right[seq(1,length(right),3)]
PTS1<-right[seq(2,length(right),3)]
PTS2<-right[seq(3,length(right),3)]

nba_data<-data.frame(Date,Start,Visitor,PTS1,Home,PTS2)

write.csv(nba_data, file = paste0("./", i, i, ".csv"))

}
Community
  • 1
  • 1
Henry Navarro
  • 943
  • 8
  • 34
0

This is a solution using the tidyvere to scrape this website. But first we check the robots.txt file of the website to get a sense of the limit rate for request. See for reference the post Analyzing “Crawl-Delay” Settings in Common Crawl robots.txt Data with R for further info.

library(spiderbar)
library(robotstxt)
rt <- robxp(get_robotstxt("https://www.basketball-reference.com"))
crawl_delays(rt)
#>             agent crawl_delay
#> 1               *           3
#> 2       ahrefsbot          -1
#> 3      twitterbot          -1
#> 4       slysearch          -1
#> 5  ground-control          -1
#> 6   groundcontrol          -1
#> 7          matrix          -1
#> 8         hal9000          -1
#> 9         carmine          -1
#> 10     the-matrix          -1
#> 11         skynet          -1

We are interested by the * value. We see we have to wait a minimum of 3 sec between requests. We will took 5 secondes.

We use the tidyverse ecosystem to build the urls and iterate through them to get a table with all the data.

library(tidyverse)
library(rvest)
#> Le chargement a nécessité le package : xml2
#> 
#> Attachement du package : 'rvest'
#> The following object is masked from 'package:purrr':
#> 
#>     pluck
#> The following object is masked from 'package:readr':
#> 
#>     guess_encoding

month_sub <- c("october", "november", "december", "january")

urls <- map_chr(month_sub, ~ paste0("https://www.basketball-reference.com/leagues/NBA_2018_games-", .,".html"))
urls
#> [1] "https://www.basketball-reference.com/leagues/NBA_2018_games-october.html" 
#> [2] "https://www.basketball-reference.com/leagues/NBA_2018_games-november.html"
#> [3] "https://www.basketball-reference.com/leagues/NBA_2018_games-december.html"
#> [4] "https://www.basketball-reference.com/leagues/NBA_2018_games-january.html"

pb <- progress_estimated(length(urls))
map(urls, ~{
  url <- .
  pb$tick()$print()
  Sys.sleep(5) # we take 5sec
  tables <- read_html(url) %>%
    # we select the table part by its table id tag
    html_nodes("#schedule") %>%
    # we extract the table
    html_table() %>%
    # we get a 1 element list so we take flatten to get a tibble
    flatten_df()
}) -> tables

# we get a list of tables, one per month
str(tables, 1)
#> List of 4
#>  $ :Classes 'tbl_df', 'tbl' and 'data.frame':    104 obs. of  8 variables:
#>  $ :Classes 'tbl_df', 'tbl' and 'data.frame':    213 obs. of  8 variables:
#>  $ :Classes 'tbl_df', 'tbl' and 'data.frame':    227 obs. of  8 variables:
#>  $ :Classes 'tbl_df', 'tbl' and 'data.frame':    216 obs. of  8 variables:

# we can get all the data in one table by binding rows.
# As we saw on the website that there are 2 empty columns with no names, 
# we need to take care of it with repair_name before row binding
res <- tables %>%
  map_df(tibble::repair_names)

res
#> # A tibble: 760 x 8
#>                 Date `Start (ET)`      `Visitor/Neutral`   PTS
#>                <chr>        <chr>                  <chr> <int>
#>  1 Tue, Oct 17, 2017      8:01 pm         Boston Celtics   102
#>  2 Tue, Oct 17, 2017     10:30 pm        Houston Rockets   121
#>  3 Wed, Oct 18, 2017      7:30 pm        Milwaukee Bucks   100
#>  4 Wed, Oct 18, 2017      8:30 pm          Atlanta Hawks   111
#>  5 Wed, Oct 18, 2017      7:00 pm      Charlotte Hornets   102
#>  6 Wed, Oct 18, 2017      7:00 pm          Brooklyn Nets   140
#>  7 Wed, Oct 18, 2017      8:00 pm   New Orleans Pelicans   103
#>  8 Wed, Oct 18, 2017      7:00 pm             Miami Heat   116
#>  9 Wed, Oct 18, 2017     10:00 pm Portland Trail Blazers    76
#> 10 Wed, Oct 18, 2017     10:00 pm        Houston Rockets   100
#> # ... with 750 more rows, and 4 more variables: `Home/Neutral` <chr>,
#> #   V1 <chr>, V2 <chr>, Notes <lgl>
Dave2e
  • 22,192
  • 18
  • 42
  • 50
cderv
  • 6,272
  • 1
  • 21
  • 31