Extracting URLs from page?

Question

I've been tearing my hair out playing with variations on this:

'//*[@id="mw-content-text"]/div[2]/table/tbody/tr/td/div/ul/li/a'

as an XPath to get all of the school district urls from this wiki page: http://en.wikipedia.org/wiki/List_of_school_districts_in_Arkansas . What's the correct XPath?

Thanks in advance!

Code snippet:

            print 3.1, tree.xpath('//*[@id="mw-content-text"]/div[2]')
            print 3.2, tree.xpath('//*[@id="mw-content-text"]/div[2]/table')
            print 3.3, tree.xpath('//*[@id="mw-content-text"]/div[2]/table/tbody')
            print 3.4, tree.xpath('//*[@id="mw-content-text"]/div[2]/table/tbody')     
            print 3.5, tree.xpath('//*[@id="mw-content-text"]/div[2]/table/tbody/tr/td/div/ul/li/a/text()')                           
            for row in tree.xpath('//*[@id="mw-content-text"]/div[2]/table/tbody/tr/td/div/ul/li/a/text()'):
                print row
                district_urls.append('http://en.wikipedia.org'+row.get('href'))

As a reference:

3.1 [<Element div at 0x1109f7f00>]
3.2 [<Element table at 0x1109f7f00>]
3.3 []
3.4 []
3.5 []

score 2 · Accepted Answer · answered Aug 14 '13 at 18:46

I guess you've been creating this XPath expression using Firebug or similar developer tools. They work on the DOM which requires <tbody/> tags around <tr/>s, to these get inserted if not given in the source code. When looking at the page source (not using Firebug, if necessary use wget or curl), you will realize there are no <tbody/> tags.

Use this expression:

//*[@id="mw-content-text"]/div[2]/table/tr/td/div/ul/li/a

Gilles Quénot · Answer 2 · 2013-08-14T18:51:14.610

0

Try this :

//*[@id="mw-content-text"]/div[2]/table/tr/td/div/ul/li/a/text()

edited Aug 14 '13 at 18:51

answered Aug 14 '13 at 18:24

Gilles Quénot

173,512
41
224
223

Did not work. I posted some output and you can see that beyond the table object, I cannot access any of the content. – goldisfine Aug 14 '13 at 18:37
Post updated with the help of Jens Erat's reminder about firebug. – Gilles Quénot Aug 14 '13 at 18:51

Extracting URLs from page?

2 Answers2