I am very new to R, and the rvest package, and I am trying to extract data from multiple tables across multiple pages.
One example is the box score of each game here:
https://www.pro-football-reference.com/boxscores/201309050den.htm
I tried the following to get the data from one table:
library(rvest)
webpage <- read_html("https://www.pro-football-reference.com/boxscores/201309050den.htm")
tbls <- html_nodes(webpage, "table")
head(tbls)
tbls_ls <- webpage %>%
html_nodes("table") %>%
.[3:3] %>%
html_table(fill = TRUE)
str(tbls_ls)
This returns:
List of 1
$ :'data.frame': 22 obs. of 22 variables:
..$ : chr [1:22] "Player" "Joe Flacco" "Ray Rice" "Bernard Pierce" ...
..$ : chr [1:22] "Tm" "BAL" "BAL" "BAL" ...
..$ Passing : chr [1:22] "Cmp" "34" "0" "0" ...
..$ Passing : chr [1:22] "Att" "62" "0" "0" ...
..$ Passing : chr [1:22] "Yds" "362" "0" "0" ...
..$ Passing : chr [1:22] "TD" "2" "0" "0" ...
..$ Passing : chr [1:22] "Int" "2" "0" "0" ...
..$ Passing : chr [1:22] "Sk" "4" "0" "0" ...
..$ Passing : chr [1:22] "Yds" "27" "0" "0" ...
..$ Passing : chr [1:22] "Lng" "34" "0" "0" ...
..$ Passing : chr [1:22] "Rate" "69.4" "" "" ...
..$ Rushing : chr [1:22] "Att" "0" "12" "9" ...
..$ Rushing : chr [1:22] "Yds" "0" "36" "22" ...
..$ Rushing : chr [1:22] "TD" "0" "1" "0" ...
..$ Rushing : chr [1:22] "Lng" "0" "12" "14" ...
..$ Receiving: chr [1:22] "Tgt" "0" "11" "1" ...
..$ Receiving: chr [1:22] "Rec" "0" "8" "0" ...
..$ Receiving: chr [1:22] "Yds" "0" "35" "0" ...
..$ Receiving: chr [1:22] "TD" "0" "0" "0" ...
..$ Receiving: chr [1:22] "Lng" "0" "10" "0" ...
..$ Fumbles : chr [1:22] "Fmb" "1" "0" "0" ...
..$ Fumbles : chr [1:22] "FL" "0" "0" "0" ...
But this is only one table from one game.
I am trying to go through all of the pages for each boxscore each week over each year.
All of the pages begin with this part of the URL:
https://www.pro-football-reference.com/boxscores/
But then I need to loop through all the dates in the year, so for example:
201309050
201309080
and team:
den
buf
(which would be all 32 teams in the NFL)
The two examples above would go to these two URL's:
https://www.pro-football-reference.com/boxscores/201309050den.htm
https://www.pro-football-reference.com/boxscores/201309080buf.htm
If I have a vector of dates and a vector of teams, is there a way to loop through each of these to check each combination and return the info from the table on each page?
Or could I use a start date and end date and somehow go through each date in the range with each team name?
Start date would be
20130901
End date would be
20140301
(for season 2013). There would be more seasons to go through, ideally 2010 - 2019.
Ideally I would like to loop through each date in a year AND each team and if a record is returned, I would like to add them all to one table, like so:
Year Week Player Team Cmp Att Yds TD Int Sk Yds Lng Rate Att Yds TD Lng Tht Rec Yds TD Lng Fmb FL
It would be good to only return the records for each quarterback, although I am not sure how this can be achieved.