1

I've been tearing my hair out playing with variations on this:

'//*[@id="mw-content-text"]/div[2]/table/tbody/tr/td/div/ul/li/a'

as an XPath to get all of the school district urls from this wiki page: http://en.wikipedia.org/wiki/List_of_school_districts_in_Arkansas . What's the correct XPath?

Thanks in advance!

Code snippet:

            print 3.1, tree.xpath('//*[@id="mw-content-text"]/div[2]')
            print 3.2, tree.xpath('//*[@id="mw-content-text"]/div[2]/table')
            print 3.3, tree.xpath('//*[@id="mw-content-text"]/div[2]/table/tbody')
            print 3.4, tree.xpath('//*[@id="mw-content-text"]/div[2]/table/tbody')     
            print 3.5, tree.xpath('//*[@id="mw-content-text"]/div[2]/table/tbody/tr/td/div/ul/li/a/text()')                           
            for row in tree.xpath('//*[@id="mw-content-text"]/div[2]/table/tbody/tr/td/div/ul/li/a/text()'):
                print row
                district_urls.append('http://en.wikipedia.org'+row.get('href')) 

As a reference:

3.1 [<Element div at 0x1109f7f00>]
3.2 [<Element table at 0x1109f7f00>]
3.3 []
3.4 []
3.5 []
goldisfine
  • 4,742
  • 11
  • 59
  • 83

2 Answers2

2

I guess you've been creating this XPath expression using Firebug or similar developer tools. They work on the DOM which requires <tbody/> tags around <tr/>s, to these get inserted if not given in the source code. When looking at the page source (not using Firebug, if necessary use wget or curl), you will realize there are no <tbody/> tags.

Use this expression:

//*[@id="mw-content-text"]/div[2]/table/tr/td/div/ul/li/a
Jens Erat
  • 37,523
  • 16
  • 80
  • 96
0

Try this :

//*[@id="mw-content-text"]/div[2]/table/tr/td/div/ul/li/a/text()
Gilles Quénot
  • 173,512
  • 41
  • 224
  • 223