import lxml.html as LH
import urllib2
url = 'http://espn.go.com/nba/team/roster/_/name/bos/boston-celtics'
doc = LH.parse(urllib2.urlopen(url))
print(doc.xpath('''
//div[@id="my-players-table"]/div//table[1]//tr/td[2]/a/text()''')[1:])
yields
['Brandon Bass', 'Avery Bradley', 'Jae Crowder', 'Jeff Green', 'Jameer Nelson',
'Kelly Olynyk', 'Phil Pressey', 'Marcus Smart', 'Jared Sullinger', 'Marcus
Thornton', 'Evan Turner', 'Gerald Wallace', 'Brandan Wright', 'James Young',
'Tyler Zeller']
When scraping a page, the first thing to do is visually inspect the HTML received using urllib or requests:
import urllib2
url = 'http://espn.go.com/nba/team/roster/_/name/bos/boston-celtics'
response = urllib2.urlopen(url)
with open('/tmp/test.html', 'wb') as f:
f.write(response.read())
Sometimes the HTML looks different than what you see in the GUI browser because
urllib or requests does not process JavaScript. In that case other tools, such
as selenium, may be needed. However, in this case, a text search for "Brandon
Bass" shows the data is accessible in the HTML downloaded with urllib2:
<td class="sortcell"><a href="http://espn.go.com/nba/player/_/id/2745/brandon-bass">Brandon Bass</a></td>
Using the XPath you posted as a starting point,
you can then use an interactive Python session to find the right XPath:
In [80]: import lxml.html as LH
In [81]: import urllib2
In [82]: url = 'http://espn.go.com/nba/team/roster/_/name/bos/boston-celtics'
In [83]: doc = LH.parse(urllib2.urlopen(url))
In [84]: [LH.tostring(elt) for elt in doc.xpath('//div[@id="my-players-table"]/div//table/tr')]
Out[84]:
['<tr class="stathead"><td colspan="8">Team Roster</td></tr>',
'<tr class="colhead"><td><a href="http://espn.go.com/nba/team/roster/_/name/bos/sort/jersey/order/false/boston-celtics">NO.</a></td><td><a href="http://espn.go.com/nba/team/roster/_/name/bos/order/false/boston-celtics">NAME</a></td><td>POS</td><td><a href="http://espn.go.com/nba/team/roster/_/name/bos/sort/age/order/false/boston-celtics">AGE</a></td><td><a href="http://espn.go.com/nba/team/roster/_/name/bos/sort/height/order/false/boston-celtics">HT</a></td><td><a href="http://espn.go.com/nba/team/roster/_/name/bos/sort/weight/order/false/boston-celtics">WT</a></td><td>COLLEGE</td><td>2014-2015 SALARY</td></tr>',
In [86]: [elt.text_content() for elt in doc.xpath('//div[@id="my-players-table"]/div//table/tr/td')]
which lead to
//div[@id="my-players-table"]/div//table[1]//tr/td[2]/a/text()
(Above, I made use of the LH.tostring
function to inspect HTML snippets, and elt.text_content()
to inspect the text contained in various elements.)
This is the first tutorial I read to understand XPath.
Once you get the basics under your belt, you can start reading the XPath v1.0
specification. There is also an XPath
v2 and XPath
v3, but current lxml only supports XPath 1.0.
Concurrently you can read the lxml docs, assuming you are using lxml.
I also found reading Stackoverflow XPath questions, such as this one, helpful.
Each time I encounter a new function or technique, I write a bit of
demonstration code -- a minimal example -- showing (myself) how it works.
That way, whenever I need to do XYZ again I can start from some runnable code.