2

I am trying to extract several data from http://www.rsssf.com/tablese/eng2014.html such as the league standings as well as the scores for each round into R.

I know that I am trying to use XML, RCurl package can be used but i am not totally sure of the way to do it.

Referring to this: Scraping html tables into R data frames using the XML package

library(XML)
theurl <- "http://en.wikipedia.org/wiki/Brazil_national_football_team"
tables <- readHTMLTable(theurl)
n.rows <- unlist(lapply(tables, function(t) dim(t)[1]))
the picked table is the longest one on the page

tables[[which.max(n.rows)]]

I still cant get the table on the website. Really appreciate if anyone can help me with this. Thanks!

Community
  • 1
  • 1
Hann
  • 63
  • 5

1 Answers1

4

The reason you are having trouble is that the given table is NOT an HTML Table. You can see that by using View Page Source in your browser. Here is some code to help you get started with extracting the data in the table and putting it into a data frame.

dat = readLines('http://www.rsssf.com/tablese/eng2014.html', warn = F)
start = grep('Table', dat)[1] + 2
end = grep('Round', dat)[1] - 2
dat2 <- dat[start:end]

dat3 = read.fwf(textConnection(dat2), widths = c(3, 24, 3, 3, 3, 3, 8, 3))
dat3[dat3$V1 != "---",]
Ramnath
  • 54,439
  • 16
  • 125
  • 152
  • thank you so much for assisting me! Another question, do you know how to extract the scores for each rounds (the data below the league standings table)? – Hann Oct 21 '13 at 06:11
  • 1
    You can do it by detecting the start of the data (Round * ...), extracting the text table for each round, and applying `read.fwf` to each table. I would suggest writing a function to do it for a particular round and then loop through using `lapply`. – Ramnath Oct 21 '13 at 11:32