Scraping basketball-reference.com in R (XML package not fully working)

Question

I have been scraping various pages of basketball-ref for a while now in R with the XML package using "readHTMLtable" without any issues, but now I have one. When I try to scrape the splits section of a player's page, it only return the first line of the table not all.

for example:

URL="http://www.basketball-reference.com/players/j/jamesle01/splits/"
tablefromURL = readHTMLTable(URL)
table = tablefromURL[[1]]

this gives me only one row in the table, the first one. I want all the rows however. I think the problem is that there are multiple headers in the table, but I'm not sure how to fix that.

Thanks

I can reproduce the issue and it is to do with the splits. I would suggest using `rvest` package instead for the scraping. Referring to this how-to guide by Alex Bresler to get data from basketball-reference.com using `rvest` would be useful: http://asbcllc.com/blog/2014/november/creating_bref_scraper/ — jalapic, Jan 08 '15 at 02:37

MrFlick · Answer 1 · 2015-01-08T04:33:58.433

2

Why not try the rvest library. You can accomplish this with

library(rvest)
dd <- html_session("http://www.basketball-reference.com/players/j/jamesle01/splits/") %>%
    html_node("table#stats") %>%
    html_table()

It's still a bit messy with the headers mixed in the data, but it does extract the entire table.

Tested with

R version 3.1.0 (2014-04-10)
Platform: x86_64-apple-darwin10.8.0 (64-bit)

other attached packages:
[1] rvest_0.2.0

loaded via a namespace (and not attached):
[1] httr_0.6.1    magrittr_1.5  stringr_0.6.2

edited Jan 08 '15 at 04:33

answered Jan 08 '15 at 02:40

MrFlick

195,160
17
277
295

I'm getting this error... Error in FUN(X[[1L]], ...) : unused argument (trim = TRUE) – hchw Jan 08 '15 at 04:03
Worked well for me: `str(dd) 'data.frame': 82 obs. of 30 variables:`. Still will need to extract the first row as column names and convert the rest to numeric. – IRTFM Jan 08 '15 at 04:07
@hchw Which version of `rvest` are you using? Did you copy/paste code exactly? – MrFlick Jan 08 '15 at 04:08
@BondedDust I've actually been trying to work on that for a while now. Do you know of an easy to filter out nodes either using rvest or XML? I'd like to delete the redundant ``. – MrFlick Jan 08 '15 at 04:09
@MrFlick I copied and pasted the code exactly. I downloaded rvest_0.2.0.tar.gz and then loaded it as a source file – hchw Jan 08 '15 at 04:11
@hchw. Is there a reason you didn't use `install.packages("rvest")`? I'm worried you may have not updated necessary dependencies. The full `traceback()` would be more helpful to determine where such an error would be coming from. – MrFlick Jan 08 '15 at 04:12
@MrFlick I couldn't download it that way because it gave me this: Warning in install.packages : package ‘rvest’ is not available (for R version 3.0.1) I'm using RStudio idk if that make a difference – hchw Jan 08 '15 at 04:16
@hchw Well, that is an older version of R but [CRAN says](http://cran.r-project.org/web/packages/rvest/index.html) that's the minimum. Maybe try looking at this question: http://stackoverflow.com/questions/25721884/how-should-i-deal-with-package-xxx-is-not-available-warning. – MrFlick Jan 08 '15 at 04:18

jdharrison · Answer 2 · 2015-01-08T07:21:57.517

You can filter on the table bodies:

library(XML)
appURL <- "http://www.basketball-reference.com/players/j/jamesle01/splits/"
doc <- htmlParse(appURL)
appTables <- doc['//table/tbody']

appTables would be a list containing the 12 tables sans headers. To retrieve the headers you can get them from the thead:

myHeaders <- unlist(doc["//thead/tr[2]/th", fun = xmlValue])
myTables <- lapply(appTables, readHTMLTable, header = myHeaders)

You can put the data in one big table using something like:

bigTable <- do.call(rbind, myTables)
> head(bigTable)
Split Value   G  GS    MP   FG   FGA   3P  3PA   FT  FTA  ORB  TRB  AST  STL BLK  TOV   PF   PTS  FG%  3P%  FT%
1          Total 871 870 34364 8582 17289 1184 3462 5553 7432 1049 6239 6011 1483 698 2906 1615 23901 .496 .342 .747
2    Place  Home 441 440 17167 4201  8307  567 1627 2805 3706  507 3133 3082  711 387 1413  744 11774 .506 .348 .757
3           Road 430 430 17197 4381  8982  617 1835 2748 3726  542 3106 2929  772 311 1493  871 12127 .488 .336 .738
4 All-Star   Pre 569 568 22349 5544 11167  759 2205 3576 4791  655 4051 3966  967 456 1940 1087 15423 .496 .344 .746
5           Post 302 302 12015 3038  6122  425 1257 1977 2641  394 2188 2045  516 242  966  528  8478 .496 .338 .749
6   Result   Win 572 571 22196 5783 11094  772 2154 3749 4931  677 4241 4132 1032 496 1793 1016 16087 .521 .358 .760
TS% USG% ORtg DRtg   MP  PTS TRB AST
1 .581 31.9  116  103 39.5 27.4 7.2 6.9
2 .592 30.9  118  102 38.9 26.7 7.1 7.0
3 .571 32.8  114  105 40.0 28.2 7.2 6.8
4 .581 31.7  116  103 39.3 27.1 7.1 7.0
5 .582 32.2  117  104 39.8 28.1 7.2 6.8
6 .606 31.7  122   99 38.8 28.1 7.4 7.2

crubba · Answer 3 · 2015-01-13T10:33:50.423

0

Have a look at the htmltab package (https://github.com/crubba/htmltab). I developed this package for more complex HTML tables where readHTMLTable() is of little use.

devtools::install_github("crubba/htmltab")
library(htmltab)
htmltab(doc = "http://www.basketball-reference.com/players/j/jamesle01/splits/", header = 1:2)

edited Jan 13 '15 at 10:33

answered Jan 12 '15 at 09:50

crubba

98
7

Scraping basketball-reference.com in R (XML package not fully working)

3 Answers3

Linked