1

Im trying to read head2head data from tennis abstract webpage in R using package XML.

I want the big h2h table at the bottom,
css selector: html > body > div#main > table#maintable > tbody > tr > td#stats > table#matches.tablesorter

I have tried following suggestions from scraping html into r data frame.
I believe the difficulty is caused by table within table

url = "http://www.tennisabstract.com/cgi-bin/player.cgi?p=NovakDjokovic&f=ACareerqqs00&view=h2h"
library(RCurl)
library(XML)

webpage <- getURL(url)
webpage <- readLines(tc <- textConnection(webpage)); close(tc)  #doesnt have the h2h table
pagetree <- htmlTreeParse(webpage, error=function(...){}, useInternalNodes = TRUE)
results <- xpathSApply(pagetree, "//*/table[@class='tablesorter']/tr/td", xmlValue)  # gives NULL

tables <- readHTMLTable( url,stringsAsFactors=T) # has 4 tables, not the desired one

I'm new to html parsing, so please bear with.

Community
  • 1
  • 1
Sujay
  • 87
  • 1
  • 8
  • 2
    I think the table is build using java script. Use [RSelenium](https://github.com/ropensci/RSelenium) to scape it. You'll find plenty of examples [here](http://stackoverflow.com/questions/tagged/rselenium). – lukeA Mar 06 '15 at 10:22

1 Answers1

3

This is not the most efficient but it will do the job.

library(rvest)
library(RSelenium)

tennis.url <- "http://www.tennisabstract.com/cgi-bin/player.cgi?p=NovakDjokovic&f=ACareerqqs00&view=h2h"

checkForServer(); startServer()
remDrv <- remoteDriver()
remDrv$open()

remDrv$navigate(tennis.url)
tennis.html <- html(remDrv$getPageSource()[[1]])

remDrv$close()

H2Hs <- tennis.html %>% html_nodes(".h2hclick") %>% html_text %>% as.numeric
Opponent <- tennis.html %>% html_nodes("#matches a") %>% html_text
Country <- tennis.html %>% html_nodes("a+ span") %>% html_text %>% gsub("[^(A-Z)]", "", .)
W <- tennis.html %>% html_nodes("#matches td:nth-child(3)") %>% .[-1] %>% html_text %>% as.numeric
L <- tennis.html %>% html_nodes("#matches td:nth-child(4)") %>% .[-1] %>% html_text %>% as.numeric
Win.Prc <- tennis.html %>% html_nodes("#matches td:nth-child(5)") %>% .[-1] %>% html_text

And so on for the rest. You just need to increment the # in nth-child(#) and then create a data frame.

hrbrmstr
  • 77,368
  • 11
  • 139
  • 205
dimitris_ps
  • 5,849
  • 3
  • 29
  • 55
  • 1
    Im getting the following error: remDrv$open() gives error : [1] "Connecting to remote server" Undefined error in RCurl call. Error in queryRD(paste0(serverURL, "/session"), "POST", qdata = toJSON(serverOpts)) : – Sujay Mar 06 '15 at 11:36
  • 1
    It started working after I did the following from command prompt. java -jar selenium-server-standalone.jar – Sujay Mar 06 '15 at 12:09