I am new to R and have a question about the most efficient way to construct a database. I would like to build a database of NFL statistics. These statistics are readily available on the web at a number of locations, but I've found the most thorough analysis to be on Pro-Football-Reference (http://www.pro-football-reference.com/). This will be panel data where the time intervals are each week of each season, my observations are each player in each game, and my columns are the statistics tallied in all of the tables of Pro-Football-Reference's boxscores (http://www.pro-football-reference.com/boxscores/201702050atl.htm).
I could scrape each table of each game of each season with something like:
#PACKAGES
library(rvest)
library(XML)
page.201702050atl = read_html("http://www.pro-football-reference.com/boxscores/201702050atl.htm")
comments.201702050atl = page.201702050atl %>% html_nodes(xpath = "//comment()")
scoring.201702050atl = readHTMLTable("http://www.pro-football-reference.com/boxscores/201702050atl.htm", which = 2)
game.info.201702050atl = comments.201702050atl[17] %>% html_text() %>% read_html() %>% html_node("#game_info") %>% html_table()
officials.201702050atl = comments.201702050atl[21] %>% html_text() %>% read_html() %>% html_node("#officials") %>% html_table()
team.stats.201702050atl = comments.201702050atl[27] %>% html_text() %>% read_html() %>% html_node("#team_stats") %>% html_table()
scorebox.201702050atl = readHTMLTable("http://www.pro-football-reference.com/boxscores/201702050atl.htm", which = 1)
expected.points.201702050atl = comments.201702050atl[22] %>% html_text() %>% read_html() %>% html_node("#expected_points") %>% html_table()
player.offense.201702050atl = comments.201702050atl[31] %>% html_text() %>% read_html() %>% html_node("#player_offense") %>% html_table()
player.defense.201702050atl = comments.201702050atl[32] %>% html_text() %>% read_html() %>% html_node("#player_defense") %>% html_table()
returns.201702050atl = comments.201702050atl[33] %>% html_text() %>% read_html() %>% html_node("#returns") %>% html_table()
kicking.201702050atl = comments.201702050atl[34] %>% html_text() %>% read_html() %>% html_node("#kicking") %>% html_table()
home.starters.201702050atl = comments.201702050atl[35] %>% html_text() %>% read_html() %>% html_node("#home_starters") %>% html_table()
vis.starters.201702050atl = comments.201702050atl[36] %>% html_text() %>% read_html() %>% html_node("#vis_starters") %>% html_table()
home.snap.counts.201702050atl = comments.201702050atl[37] %>% html_text() %>% read_html() %>% html_node("#home_snap_counts") %>% html_table()
vis.snap.counts.201702050atl = comments.201702050atl[38] %>% html_text() %>% read_html() %>% html_node("#vis_snap_counts") %>% html_table()
targets.directions.201702050atl = comments.201702050atl[39] %>% html_text() %>% read_html() %>% html_node("#targets_directions") %>% html_table()
rush.directions.201702050atl = comments.201702050atl[40] %>% html_text() %>% read_html() %>% html_node("#rush_directions") %>% html_table()
pass.tackles.201702050atl = comments.201702050atl[41] %>% html_text() %>% read_html() %>% html_node("#pass_tackles") %>% html_table()
rush.tackles.201702050atl = comments.201702050atl[42] %>% html_text() %>% read_html() %>% html_node("#rush_tackles") %>% html_table()
home.drives.201702050atl = comments.201702050atl[43] %>% html_text() %>% read_html() %>% html_node("#home_drives") %>% html_table()
vis.drives.201702050atl = comments.201702050atl[44] %>% html_text() %>% read_html() %>% html_node("#vis_drives") %>% html_table()
pbp.201702050atl = comments.201702050atl[45] %>% html_text() %>% read_html() %>% html_node("#pbp") %>% html_table()
However, the number of lines of code needed to clean up each scraped table for 256 games each year seems to suggest a more efficient method might exist.
The NFL officially records stats in their game books (http://www.nfl.com/liveupdate/gamecenter/57167/ATL_Gamebook.pdf). Since sites, like Pro-Football-Reference, include stats not tallied in the official game books, and since the identifying anguage necessary to do so is included in the game books' Play-by-Play, I deduce they are running a function to parse the Play-by-Play and tally their statistics. New as I am, I've never written a function or parsed anything in R before; but, I'd imagine one function I can apply to every game book would be a more efficient method than scraping each individual table. Am I on the right path here? I'd hate to invest a ton of effort in the wrong direction.
An additional problem arises because the game books are PDFs. Play-by-Plays exist on other websites in table format, but none are as complete. I've read some excellent tutorials on this site about how to convert a PDF into text using
library(tm)
But, I've not yet figured it out for my own purposes.
Once I convert the entire PDF to text, do I simply identify the Play-by-Play portion, parse it out, and from there parse out each statistic? Are there adittional obstacles my limitted experience has prevented me from forseeing?
This might be too "beginner" of a question for this website; but, could anyone set me up here? Or, provide me with a resource that could? Thanks so much for the help.