0

I want to extract a table from web http://en.wikipedia.org/wiki/Brazil_national_football_team

library(XML)
baseURL <- "http://en.wikipedia.org/wiki/Brazil_national_football_team"
xmltext <- htmlParse(baseURL)
xmltable <- xpathApply(xmltext, "//table[.//tbody//tr//th//a[@title='CONCACAF Gold Cup']]") 

Here is the xpath :"//table[.//tbody//tr//th//a[@title='CONCACAF Gold Cup']]"

neither

xmltable <- xpathApply(xmltext, "//table[.//tbody//tr//th//a[@title='CONCACAF Gold Cup']]")  

nor

xmltable <- xpathApply(xmltext, "//table[//tbody//tr//th//a[@title='CONCACAF Gold Cup']]")

Can get the specified table. How can I write xpath expression?
Please see the attchment . enter image description here

Maurício Linhares
  • 39,901
  • 14
  • 121
  • 158
Dd Pp
  • 5,727
  • 4
  • 21
  • 19
  • possible duplicate of http://stackoverflow.com/questions/1395528/scraping-html-tables-into-r-data-frames-using-the-xml-package – dickoa Sep 02 '12 at 12:18
  • yes , it is a good example ,i want to understand well,my ideal is not the same as that one. – Dd Pp Sep 02 '12 at 13:23

2 Answers2

1

You have to use .. to get the parent element in your xpath: //table[@class='wikitable']//th//a[@title='CONCACAF Gold Cup']/../../..

To get the table you could use XML::readHTMLTable:

library(XML)
baseURL <- "http://en.wikipedia.org/wiki/Brazil_national_football_team"
xmltext <- htmlParse(baseURL)

## grep correct table
tableNode <- xpathApply(xmltext, "//table[@class='wikitable']//th//a[@title='CONCACAF Gold Cup']/../../..")[[1]]

## convert XMLNode into data.frame
concacafTable <- readHTMLTable(tableNode, header=FALSE, stringsAsFactors=FALSE)

## format table (remove useless "Gold Cup"-header (row 1) and set right header (row 2)
colnames(concacafTable) <- concacafTable[2, ]
concacafTable <- concacafTable[-c(1,2),]
concacafTable
#   Year       Round GP W D L GF GA
#3  1996  Runners-up  4 3 0 1 10  3
#4  1998 Third Place  5 2 2 1  6  2
#5  2003  Runners-up  5 3 0 2  6  4                                                 
#6 Total        3/11 14 8 2 4 22  9
sgibb
  • 25,396
  • 3
  • 68
  • 74
0

i find two secretaries in parsing the web too,

1.tbody can't be known

tableNode <- xpathApply(xmltext, "//tbody") 

can get nothing.there are many tbody element in the web ,none of them were be recognized as formal element.

2.to directly get the table,not to use the concept of parent element

tableNode <- xpathApply(xmltext, "//table[@class='wikitable'][./tr/th/a[@title='CONCACAF Gold Cup']]") can work too. 
Dd Pp
  • 5,727
  • 4
  • 21
  • 19