How to scrape tables inside a comment tag in html with R?

Question

I am trying to scrape from http://www.basketball-reference.com/teams/CHI/2015.html using rvest. I used selectorgadget and found the tag to be #advanced for the table I want. However, I noticed it wasn't picking it up. Looking at the page source, I noticed that the tables are inside an html comment tag <!--

What is the best way to get the tables from inside the comment tags? Thanks!

Edit: I am trying to pull the 'Advanced' table: http://www.basketball-reference.com/teams/CHI/2015.html#advanced::none

instead of rvest try base::readLines, paste0 and collapse with "" then stringi::stri_extract_all_regex (pasted_data, "\\<\\! (.*?)\\-\\->") then remove the comment strings and use rvest for the table. Sorry this is so sloppy..saw this while on phone...trying best to help haha. — Carl Boneri, Nov 15 '16 at 17:50
Hey sorry I didn't include that but I'm trying to pull the advanced table. Just added it to my question — David Sung, Nov 15 '16 at 18:16
And I appreciate the help Carl even on your phone! I am looking into that now — David Sung, Nov 15 '16 at 18:17

score 9 · Answer 1 · answered Nov 18 '16 at 02:07

You can use the XPath comment() function to select comment nodes, then reparse their contents as HTML:

library(rvest)

# scrape page
h <- read_html('http://www.basketball-reference.com/teams/CHI/2015.html')

df <- h %>% html_nodes(xpath = '//comment()') %>%    # select comment nodes
    html_text() %>%    # extract comment text
    paste(collapse = '') %>%    # collapse to a single string
    read_html() %>%    # reparse to HTML
    html_node('table#advanced') %>%    # select the desired table
    html_table() %>%    # parse table
    .[colSums(is.na(.)) < nrow(.)]    # get rid of spacer columns

df[, 1:15]
##    Rk           Player Age  G   MP  PER   TS%  3PAr   FTr ORB% DRB% TRB% AST% STL% BLK%
## 1   1        Pau Gasol  34 78 2681 22.7 0.550 0.023 0.317  9.2 27.6 18.6 14.4  0.5  4.0
## 2   2     Jimmy Butler  25 65 2513 21.3 0.583 0.212 0.508  5.1 11.2  8.2 14.4  2.3  1.0
## 3   3      Joakim Noah  29 67 2049 15.3 0.482 0.005 0.407 11.9 22.1 17.1 23.0  1.2  2.6
## 4   4     Aaron Brooks  30 82 1885 14.4 0.534 0.383 0.213  1.9  7.5  4.8 24.2  1.5  0.6
## 5   5    Mike Dunleavy  34 63 1838 11.6 0.573 0.547 0.181  1.7 12.7  7.3  9.7  1.1  0.8
## 6   6       Taj Gibson  29 62 1692 16.1 0.545 0.000 0.364 10.7 14.6 12.7  6.9  1.1  3.2
## 7   7   Nikola Mirotic  23 82 1654 17.9 0.556 0.502 0.455  4.3 21.8 13.3  9.7  1.7  2.4
## 8   8     Kirk Hinrich  34 66 1610  6.8 0.468 0.441 0.131  1.4  6.6  4.1 13.8  1.5  0.6
## 9   9     Derrick Rose  26 51 1530 15.9 0.493 0.325 0.224  2.6  8.7  5.7 30.7  1.2  0.8
## 10 10       Tony Snell  23 72 1412 10.2 0.550 0.531 0.148  2.5 10.9  6.8  6.8  1.2  0.6
## 11 11    E'Twaun Moore  25 56  504 10.3 0.504 0.273 0.144  2.7  7.1  5.0 10.4  2.1  0.9
## 12 12   Doug McDermott  23 36  321  6.1 0.480 0.383 0.140  2.1 12.2  7.3  3.0  0.6  0.2
## 13 13    Nazr Mohammed  37 23  128  8.7 0.431 0.000 0.100  9.6 22.3 16.1  3.6  1.6  2.8
## 14 14 Cameron Bairstow  24 18   64  2.1 0.309 0.000 0.357 10.5  3.3  6.8  2.2  1.6  1.1

Carl Boneri · Accepted Answer · 2016-11-15T18:51:03.660

3

Ok..got it.

library(stringi)
library(knitr)
library(rvest)


 any_version_html <- function(x){
       XML::htmlParse(x)
    }
a <- 'http://www.basketball-reference.com/teams/CHI/2015.html#advanced::none'
b <- readLines(a)
c <- paste0(b, collapse = "")
d <- as.character(unlist(stri_extract_all_regex(c, '<table(.*?)/table>', omit_no_match = T, simplify = T)))

e <- html_table(any_version_html(d))


> kable(summary(e),'rst')
======  ==========  ====
Length  Class       Mode
======  ==========  ====
9       data.frame  list
2       data.frame  list
24      data.frame  list
21      data.frame  list
28      data.frame  list
28      data.frame  list
27      data.frame  list
30      data.frame  list
27      data.frame  list
27      data.frame  list
28      data.frame  list
28      data.frame  list
27      data.frame  list
30      data.frame  list
27      data.frame  list
27      data.frame  list
3       data.frame  list
======  ==========  ====


kable(e[[1]],'rst')


===  ================  ===  ====  ===  ==================  ===  ===  =================================
No.  Player            Pos  Ht     Wt  Birth Date          Â    Exp  College                          
===  ================  ===  ====  ===  ==================  ===  ===  =================================
 41  Cameron Bairstow  PF   6-9   250  December 7, 1990    au   R    University of New Mexico         
  0  Aaron Brooks      PG   6-0   161  January 14, 1985    us   6    University of Oregon             
 21  Jimmy Butler      SG   6-7   220  September 14, 1989  us   3    Marquette University             
 34  Mike Dunleavy     SF   6-9   230  September 15, 1980  us   12   Duke University                  
 16  Pau Gasol         PF   7-0   250  July 6, 1980        es   13                                    
 22  Taj Gibson        PF   6-9   225  June 24, 1985       us   5    University of Southern California
 12  Kirk Hinrich      SG   6-4   190  January 2, 1981     us   11   University of Kansas             
  3  Doug McDermott    SF   6-8   225  January 3, 1992     us   R    Creighton University    


## Realized we should index with some names...but this is somewhat cheating as we know the start and end indexes for table titles..I prefer to parse-in-the-dark.

# Names are in h2-tags
e_names <- as.character(unlist(stri_extract_all_regex(c, '<h2(.*?)/h2>', simplify = T)))
e_names <- gsub("<(.*?)>","",e_names[grep('Roster',e_names):grep('Salaries',e_names)])
names(e) <- e_names
kable(head(e$Salaries), 'rst')

===  ==============  ===========
 Rk  Player          Salary     
===  ==============  ===========
  1  Derrick Rose    $18,862,875
  2  Carlos Boozer   $13,550,000
  3  Joakim Noah     $12,200,000
  4  Taj Gibson      $8,000,000 
  5  Pau Gasol       $7,128,000 
  6  Nikola Mirotic  $5,305,000 
===  ==============  ===========

edited Nov 15 '16 at 18:51

answered Nov 15 '16 at 18:16

Carl Boneri

2,632
1
13
15

Hi Carl. That is not the table that I want. The table that I want is in the ' – David Sung Nov 15 '16 at 18:23
@DavidSung I noticed that as soon as I hit enter. Sorry about that. Above should get you where you want to be, though! Let me know if any issues. Uses package:stringi – Carl Boneri Nov 15 '16 at 18:25
I'm getting `length(e) [1] 1` – Pierre L Nov 15 '16 at 18:39
Hi Carl, I really appreciate all the help. I am still having some difficulty on my end. When I run your code, I am still only getting the one dataframe instead of the 17 that you got. Again, I am just copying and pasting your code but getting different results. – David Sung Nov 15 '16 at 18:43
@PierreLafortune the advanced::none was from the post above. As for the table regex: in plain-terms what it's saying is "Find all strings that begin with " . In our case we found 17 occurances; ie. 17 tables. Unlist those and turn into a character vector. Then we can encode back to html without the comment tags and use the html_table function. Basically using the html structure to tell our search-pattern what we want.
– Carl Boneri Nov 15 '16 at 18:44
I am getting one dataframe from the output – Pierre L Nov 15 '16 at 18:45
What's the length of 'd' for you both? – Carl Boneri Nov 15 '16 at 18:46
`length(d) [1] 17` – Pierre L Nov 15 '16 at 18:47
so the problem is the ''read_html'' . I use an older version of rvest from source.. which is actually called with `html` and i substituted read_html for ease of use..which I guess backfired! – Carl Boneri Nov 15 '16 at 18:48
This should fix it `XML::htmlParse(d)` – Carl Boneri Nov 15 '16 at 18:49
Didn't work. This did `e <- lapply(d, function(.d) html_table(read_html(.d))[[1]])` – Pierre L Nov 15 '16 at 18:52
Strange..hmmm. Worked on my Linux and Windows Rstudio's. What version of stringi are you running? – Carl Boneri Nov 15 '16 at 18:53
`rvest_0.3.2 xml2_1.0.0 knitr_1.14 stringi_1.1.2` – Pierre L Nov 15 '16 at 18:57
When I used your code and pasted the string into d, I used: `write(d, file = "data.html")` and then used `html_table(read_html('data.html'))` I get the 17 tables you got. Not sure why that works though and not your method – David Sung Nov 15 '16 at 18:58
A million ways to get to a scraping answer in my experience. I'm sure there's a much more elegant way to skin this car. and @PierreLafortune `rvest 0.2.0 stringi 1.1.2 xml2 1.0.0` – Carl Boneri Nov 15 '16 at 19:03
Yeap, I really appreciate the time you took to help. Thank you, Carl – David Sung Nov 15 '16 at 19:07
No problem; sorry for the hiccup. I prefer the older version of rvest...so I should have included a disclaimer! Have a great day, gentleman. – Carl Boneri Nov 15 '16 at 19:08
[Using regex to parse HTML is discouraged.](http://stackoverflow.com/a/1732454/4497050) – alistaire Nov 18 '16 at 00:47
@alistaire understood and I have read up on the topic before; what would be another approach to this problem? – Carl Boneri Nov 18 '16 at 01:03
1

@CarlBoneri See below for my current approach. Even without XPath's `comment()` function, though, the simpler approach would be to `gsub` out comment tags (opening and closing, but not contents) and reparse as HTML. – alistaire Nov 18 '16 at 02:11
so it's okay to use a regex if removing, but not parsing? didn't know about the comment selector; thanks for the example! – Carl Boneri Nov 18 '16 at 02:13
The less the better, really. The more regex you use on HTML, the bigger the chance you'll introduce a bug you won't see because you're not inspecting the thousands of lines of HTML by hand, and you don't get error messages if you parse it wrong. – alistaire Nov 18 '16 at 02:41
@alistaire genuinely appreciate the example. I use node selectors in most of my functions but really need to study up on xpath from what I can tell. Thanks again – Carl Boneri Nov 18 '16 at 02:43

How to scrape tables inside a comment tag in html with R?

2 Answers2

Linked

Related