0

I am very new to R, and the rvest package, and I am trying to extract data from multiple tables across multiple pages.

One example is the box score of each game here:

https://www.pro-football-reference.com/boxscores/201309050den.htm

I tried the following to get the data from one table:

library(rvest)

webpage <- read_html("https://www.pro-football-reference.com/boxscores/201309050den.htm")

tbls <- html_nodes(webpage, "table")

head(tbls)


tbls_ls <- webpage %>%
  html_nodes("table") %>%
  .[3:3] %>%
  html_table(fill = TRUE)

str(tbls_ls)

This returns:

List of 1
    $ :'data.frame':      22 obs. of  22 variables:
      ..$          : chr [1:22] "Player" "Joe Flacco" "Ray Rice" "Bernard Pierce" ...
      ..$          : chr [1:22] "Tm" "BAL" "BAL" "BAL" ...
      ..$ Passing  : chr [1:22] "Cmp" "34" "0" "0" ...
      ..$ Passing  : chr [1:22] "Att" "62" "0" "0" ...
      ..$ Passing  : chr [1:22] "Yds" "362" "0" "0" ...
      ..$ Passing  : chr [1:22] "TD" "2" "0" "0" ...
      ..$ Passing  : chr [1:22] "Int" "2" "0" "0" ...
      ..$ Passing  : chr [1:22] "Sk" "4" "0" "0" ...
      ..$ Passing  : chr [1:22] "Yds" "27" "0" "0" ...
      ..$ Passing  : chr [1:22] "Lng" "34" "0" "0" ...
      ..$ Passing  : chr [1:22] "Rate" "69.4" "" "" ...
      ..$ Rushing  : chr [1:22] "Att" "0" "12" "9" ...
      ..$ Rushing  : chr [1:22] "Yds" "0" "36" "22" ...
      ..$ Rushing  : chr [1:22] "TD" "0" "1" "0" ...
      ..$ Rushing  : chr [1:22] "Lng" "0" "12" "14" ...
      ..$ Receiving: chr [1:22] "Tgt" "0" "11" "1" ...
      ..$ Receiving: chr [1:22] "Rec" "0" "8" "0" ...
      ..$ Receiving: chr [1:22] "Yds" "0" "35" "0" ...
      ..$ Receiving: chr [1:22] "TD" "0" "0" "0" ...
      ..$ Receiving: chr [1:22] "Lng" "0" "10" "0" ...
      ..$ Fumbles  : chr [1:22] "Fmb" "1" "0" "0" ...
      ..$ Fumbles  : chr [1:22] "FL" "0" "0" "0" ...

But this is only one table from one game.

I am trying to go through all of the pages for each boxscore each week over each year.

All of the pages begin with this part of the URL:

https://www.pro-football-reference.com/boxscores/

But then I need to loop through all the dates in the year, so for example:

201309050
201309080

and team:

den
buf

(which would be all 32 teams in the NFL)

The two examples above would go to these two URL's:

https://www.pro-football-reference.com/boxscores/201309050den.htm
https://www.pro-football-reference.com/boxscores/201309080buf.htm

If I have a vector of dates and a vector of teams, is there a way to loop through each of these to check each combination and return the info from the table on each page?

Or could I use a start date and end date and somehow go through each date in the range with each team name?

Start date would be

20130901

End date would be

20140301

(for season 2013). There would be more seasons to go through, ideally 2010 - 2019.

Ideally I would like to loop through each date in a year AND each team and if a record is returned, I would like to add them all to one table, like so:

Year   Week   Player  Team    Cmp   Att   Yds   TD   Int   Sk   Yds   Lng  Rate   Att   Yds   TD   Lng   Tht   Rec   Yds   TD   Lng   Fmb   FL

It would be good to only return the records for each quarterback, although I am not sure how this can be achieved.

Michael
  • 221
  • 1
  • 8
  • You can get all the tables in a list with `lapply(tbls, html_table, fill=TRUE)` – MrFlick Aug 31 '20 at 04:56
  • does this give me all the tables across multiple pages? I assume it only works from one page? – Michael Aug 31 '20 at 05:00
  • What other pages do you need? Do you have a vector of URLs? You can `lapply` over those values as well – MrFlick Aug 31 '20 at 05:01
  • the above is only one box score for one game. I would need multiple pages for each game each week. I am unsure how to use the vector of URLs? – Michael Aug 31 '20 at 05:05
  • Maybe this can help with that part: https://stackoverflow.com/questions/36683510/r-web-scraping-across-multiple-pages or this: https://stackoverflow.com/questions/42789654/scraping-data-from-multiple-pages-in-r-using-rvest. It would help to edit the question to make it clear what the programming question is. If this question is just about how data is stored on an external website, that's not really on-topic. – MrFlick Aug 31 '20 at 05:16
  • I will take a look at those. The programming question is how to obtain the data from each table across multiple pages, I am not sure how much clearer I can make it. Is there a generic way I can loop though each date and each team? – Michael Aug 31 '20 at 05:21
  • But you only give one page in your question. So updating that would make it more clear. And also show exactly what your expected output would be. How do you want to collect all these tables? – MrFlick Aug 31 '20 at 05:22
  • ok edited above. Hope this is clearer. – Michael Aug 31 '20 at 05:40
  • If you check your `tbls_ls` the data is not properly extracted. Column names are in 1st row. Also 2 tables are combined into one (see row 12). – Ronak Shah Sep 01 '20 at 03:38

0 Answers0