Parsing versus Scraping in R

Question

I am new to R and have a question about the most efficient way to construct a database. I would like to build a database of NFL statistics. These statistics are readily available on the web at a number of locations, but I've found the most thorough analysis to be on Pro-Football-Reference (http://www.pro-football-reference.com/). This will be panel data where the time intervals are each week of each season, my observations are each player in each game, and my columns are the statistics tallied in all of the tables of Pro-Football-Reference's boxscores (http://www.pro-football-reference.com/boxscores/201702050atl.htm).

I could scrape each table of each game of each season with something like:

#PACKAGES
library(rvest)
library(XML)
page.201702050atl = read_html("http://www.pro-football-reference.com/boxscores/201702050atl.htm")
comments.201702050atl = page.201702050atl %>% html_nodes(xpath = "//comment()")
scoring.201702050atl = readHTMLTable("http://www.pro-football-reference.com/boxscores/201702050atl.htm", which = 2)
game.info.201702050atl = comments.201702050atl[17] %>% html_text() %>% read_html() %>% html_node("#game_info") %>% html_table()
officials.201702050atl = comments.201702050atl[21] %>% html_text() %>% read_html() %>% html_node("#officials") %>% html_table()
team.stats.201702050atl = comments.201702050atl[27] %>% html_text() %>% read_html() %>% html_node("#team_stats") %>% html_table()
scorebox.201702050atl = readHTMLTable("http://www.pro-football-reference.com/boxscores/201702050atl.htm", which = 1)
expected.points.201702050atl = comments.201702050atl[22] %>% html_text() %>% read_html() %>% html_node("#expected_points") %>% html_table()
player.offense.201702050atl = comments.201702050atl[31] %>% html_text() %>% read_html() %>% html_node("#player_offense") %>% html_table()
player.defense.201702050atl = comments.201702050atl[32] %>% html_text() %>% read_html() %>% html_node("#player_defense") %>% html_table()
returns.201702050atl = comments.201702050atl[33] %>% html_text() %>% read_html() %>% html_node("#returns") %>% html_table()
kicking.201702050atl = comments.201702050atl[34] %>% html_text() %>% read_html() %>% html_node("#kicking") %>% html_table()
home.starters.201702050atl = comments.201702050atl[35] %>% html_text() %>% read_html() %>% html_node("#home_starters") %>% html_table()
vis.starters.201702050atl = comments.201702050atl[36] %>% html_text() %>% read_html() %>% html_node("#vis_starters") %>% html_table()
home.snap.counts.201702050atl = comments.201702050atl[37] %>% html_text() %>% read_html() %>% html_node("#home_snap_counts") %>% html_table()
vis.snap.counts.201702050atl = comments.201702050atl[38] %>% html_text() %>% read_html() %>% html_node("#vis_snap_counts") %>% html_table()
targets.directions.201702050atl = comments.201702050atl[39] %>% html_text() %>% read_html() %>% html_node("#targets_directions") %>% html_table()
rush.directions.201702050atl = comments.201702050atl[40] %>% html_text() %>% read_html() %>% html_node("#rush_directions") %>% html_table()
pass.tackles.201702050atl = comments.201702050atl[41] %>% html_text() %>% read_html() %>% html_node("#pass_tackles") %>% html_table()
rush.tackles.201702050atl = comments.201702050atl[42] %>% html_text() %>% read_html() %>% html_node("#rush_tackles") %>% html_table()
home.drives.201702050atl = comments.201702050atl[43] %>% html_text() %>% read_html() %>% html_node("#home_drives") %>% html_table()
vis.drives.201702050atl = comments.201702050atl[44] %>% html_text() %>% read_html() %>% html_node("#vis_drives") %>% html_table()
pbp.201702050atl = comments.201702050atl[45] %>% html_text() %>% read_html() %>% html_node("#pbp") %>% html_table()

However, the number of lines of code needed to clean up each scraped table for 256 games each year seems to suggest a more efficient method might exist.

The NFL officially records stats in their game books (http://www.nfl.com/liveupdate/gamecenter/57167/ATL_Gamebook.pdf). Since sites, like Pro-Football-Reference, include stats not tallied in the official game books, and since the identifying anguage necessary to do so is included in the game books' Play-by-Play, I deduce they are running a function to parse the Play-by-Play and tally their statistics. New as I am, I've never written a function or parsed anything in R before; but, I'd imagine one function I can apply to every game book would be a more efficient method than scraping each individual table. Am I on the right path here? I'd hate to invest a ton of effort in the wrong direction.

An additional problem arises because the game books are PDFs. Play-by-Plays exist on other websites in table format, but none are as complete. I've read some excellent tutorials on this site about how to convert a PDF into text using

library(tm)

But, I've not yet figured it out for my own purposes.

Once I convert the entire PDF to text, do I simply identify the Play-by-Play portion, parse it out, and from there parse out each statistic? Are there adittional obstacles my limitted experience has prevented me from forseeing?

This might be too "beginner" of a question for this website; but, could anyone set me up here? Or, provide me with a resource that could? Thanks so much for the help.

If this isnt needed in real time I would download the .pdfs, collate them in a corpus using tm or another package and then extract your data. word of caution pulling the pdfs may be a violation of their TOS. There are also a variety of API's available that do something similar already. — Phi, Jun 10 '17 at 18:34

Parfait · Accepted Answer · 2017-06-13T17:08:32.643

Consider generalizing your one game parsing to all games by storing html tables in a growing list for all 256 games. Below is example for Week 1.

doc <- htmlParse(readLines("http://www.pro-football-reference.com/years/2016/week_1.htm"))

# EXTRACT ALL GAME PAGES
games <- xpathSApply(doc, "//a[text()='Final']/@href")

# FUNCTION TO BUILD HTML TABLE LIST
getwebdata <- function(url) {
    print(url)
    boxscoreurl <- paste0("http://www.pro-football-reference.com", url)
    page <- read_html(boxscoreurl)
    comments <- page %>% html_nodes(xpath = "//comment()")

    list(
      scoring = readHTMLTable(boxscoreurl, which = 2),
      game.info = comments[17] %>% html_text() %>% read_html() %>% html_node("#game_info") %>% html_table(),
      officials = comments[21] %>% html_text() %>% read_html() %>% html_node("#officials") %>% html_table(),
      team.stats = comments[27] %>% html_text() %>% read_html() %>% html_node("#team_stats") %>% html_table(),
      scorebox = readHTMLTable(boxscoreurl, which = 1),
      expected.points = comments[22] %>% html_text() %>% read_html() %>% html_node("#expected_points") %>% html_table(),
      player.offense = comments[31] %>% html_text() %>% read_html() %>% html_node("#player_offense") %>% html_table(),
      player.defense = comments[32] %>% html_text() %>% read_html() %>% html_node("#player_defense") %>% html_table(),
      returns = comments[33] %>% html_text() %>% read_html() %>% html_node("#returns") %>% html_table(),
      kicking = comments[34] %>% html_text() %>% read_html() %>% html_node("#kicking") %>% html_table(),
      home.starters = comments[35] %>% html_text() %>% read_html() %>% html_node("#home_starters") %>% html_table(),
      vis.starters = comments[36] %>% html_text() %>% read_html() %>% html_node("#vis_starters") %>% html_table(),
      home.snap.counts = comments[37] %>% html_text() %>% read_html() %>% html_node("#home_snap_counts") %>% html_table(),
      vis.snap.counts = comments[38] %>% html_text() %>% read_html() %>% html_node("#vis_snap_counts") %>% html_table(),
      targets.directions = comments[39] %>% html_text() %>% read_html() %>% html_node("#targets_directions") %>% html_table(),
      rush.directions = comments[40] %>% html_text() %>% read_html() %>% html_node("#rush_directions") %>% html_table(),
      pass.tackles = comments[41] %>% html_text() %>% read_html() %>% html_node("#pass_tackles") %>% html_table(),
      rush.tackles = comments[42] %>% html_text() %>% read_html() %>% html_node("#rush_tackles") %>% html_table(),
      home.drives = comments[43] %>% html_text() %>% read_html() %>% html_node("#home_drives") %>% html_table(),
      vis.drives = comments[44] %>% html_text() %>% read_html() %>% html_node("#vis_drives") %>% html_table(),
      pbp = comments[45] %>% html_text() %>% read_html() %>% html_node("#pbp") %>% html_table()
    )
}

# ALL WEEK ONE LIST OF HTML TABLE(S) DATA
week1datalist <- lapply(games, getwebdata)

# TRY/CATCH VERSION (ANY PARSING ERROR TO RETURN EMPTY LIST)
week1datalist <- lapply(games, function(g) {
   tryCatch({ return(getwebdata(g)) 
      }, error = function(e) return(list())
})

# NAME EACH LIST ELEMENT BY CORRESPONDING GAME
shortgames <- gsub("/", "", gsub(".htm", "", games))
week1datalist <- setNames(week1datalist, shortgames)

Ultimately, you can then reference one game's specific stats table by name:

week1datalist$boxscores201609080den$scoring

week1datalist$boxscores201609110atl$game.info

Also, you might need to include tryCatch in lapply since some pages may not be consistent.

Thank you, Parfait, for your thoughtful response! It looks like this will ultimately do exactly what I'm looking for; however, I'm running into an error. It looks like there's an issue with the function when I go to create the week one list of HTML tables. > week1datalist <- lapply(games, getwebdata) Error in FUN(X[[i]], ...) : object 'comments' not found Any idea what's going on? — DataProphets, Jun 13 '17 at 16:44
Whoops, simply change `comments <` (less than symbol) to `comments <-` (R assignment operator) which is a change from your equal signs. — Parfait, Jun 13 '17 at 16:52
Nevermind, that was just a typo and I found it. Now, I'm getting this error: Error: ' div#akunit_mobile_1 ' does not exist in current working directory ('C:/Users/Owner/Documents'). Is this an instance where I'll need to include tryCatch? — DataProphets, Jun 13 '17 at 17:00
At the very top of function add `print(url)`. So when method is being looped you know which page hits the error. Is it the first game in *games* list? I also add a `tryCatch` inside `lapply` to return empty lists on any error. — Parfait, Jun 13 '17 at 17:09
> week1datalist <- lapply(games, getwebdata) [1] "/boxscores/201609080den.htm" Show Traceback Rerun with Debug Error: ' div#akunit_mobile_1 ' does not exist in current working directory ('C:/Users/Owner/Documents'). I think the answer to your question is yes. It looks like the issue is with the Denver game? and the Denver Carolina game is the first one on the page. — DataProphets, Jun 13 '17 at 17:15
Are you sure your Atlanta game parsing works with above code? I assumed it does. — Parfait, Jun 13 '17 at 17:43
You're right, I forgot I was using an old code and the page had since updated. Each bracketed number after "comments" should be 2 integers lower. I made this correction and the code works perfectly! I do have one additional question, however. The next thing I need to do is change the formatting of all these tables... Is there a way to incorporate that into the above protocol, so as to avoid calling each specific table to format? My uneducated attempts of incorporating it into getwebdata function have, so far, failed. — DataProphets, Jul 08 '17 at 07:21
Great to hear from you! Usually when OPs ghost you likely never hear back from them. Glad I could help. Please accept solution (ticker to side) if it worked to confirm resolution. — Parfait, Jul 08 '17 at 13:47
As for other item, output here are not tables but a list of many elements where only two, *scorebox* and *scoring*, were html tables turned into data frames from `readHTMLTable()`. Assign these two to named dfs before `list()`, change formatting you need, then assign the named dfs inside `list()` all else the same: `list(scoring=myscoringdf, game.info = comments[17]...)`. — Parfait, Jul 08 '17 at 13:49
lol I didn't ghost ya, just been busy. thanks for all your help! — DataProphets, Jul 09 '17 at 08:38

Parsing versus Scraping in R

1 Answers1