5

I would like to import data into R from a table like this:

http://www.rout.gr/index.php?name=Rout&file=results&year=2011

I tried using XML library as suggested by the thread below but I couldn't get anything.

Scraping html tables into R data frames using the XML package

Community
  • 1
  • 1
ThanasisN
  • 73
  • 1
  • 9

2 Answers2

7

There do seem to be some funky things going on with that site. It seems to return no data unless you fake the user-agent. Even then, readHTMLTable doesn't behave too well, returning an error if you pass it the whole doc. After reading the source, you can see that the relevant table has id table_results_r_1 and isolating that and passing the result through works:

library(XML)
library(httr)

theurl <- "http://www.rout.gr/index.php?name=Rout&file=results&year=2011"
doc <- htmlParse(GET(theurl, user_agent("Mozilla")))
results <- xpathSApply(doc, "//*/table[@id='table_results_r_1']")
results <- readHTMLTable(results[[1]])
rm(doc)

Now you'll need to tidy up the table column names.

seancarmody
  • 6,182
  • 2
  • 34
  • 31
  • If anyone knows why an error results from trying `readHTMLTable` directly on `doc` here, I'd be interested to understand it! – seancarmody Aug 11 '12 at 06:23
3

Further to my comments

theurl <- "http://www.rout.gr/index.php?name=Rout&file=results&year=2011"
doc <- htmlParse(GET(theurl, user_agent("Mozilla")))
removeNodes(getNodeSet(doc,"//*/comment()"))
dum.tables<-readHTMLTable(doc)

so the comments amongst the headers for the 14th table were causing issues. We can remove all html comments and then the function will work on all the tables on the page.

shhhhimhuntingrabbits
  • 7,397
  • 2
  • 23
  • 23