1

New to Python, specifically xpath - attempting to scrape a list of strings into a Python list. I understand what I'm trying to do but don't know how I would write this. I'm trying to pull player names from ESPN's team roster page:

I know my code would look something like this because there's a table and each entry that I want to pull has this xpath (taken from Chrome) - where I believe a is either pointing to the link or the text which the link, links to.

//*[@id="my-players-table"]/div[2]/div/table[1]/tbody/tr[3]/td[2]/a

For my problem, when the tr element on the right, is incremented, that changes the player name <--- relevant to my problem because this is the data I ultimately seek.

For EachRow in Table:
    If ChildElement exists:
    Add Child Element to List
    Else: nextrow

Now would i just replace EachRow with //*[@id="my-players-table"]/div[2]/div/table[1]/tbody/tr[i] and ChildElement with //*[@id="my-players-table"]/div[2]/div/table[1]/tbody/tr[i]/td[2]/a ?

Also does anyone have a good blog, or learning post where I can master Xpath or more specifically master Xpath when used alongside Python? I wonder about reading the documentation, because I'm not sure if they have relevant examples, but if it does, I will gladly take a look.

Thanks and Merry XMas everyone

BTW: the link I'm trying to dissect http://espn.go.com/nba/team/roster/_/name/bos/boston-celtics

9000
  • 39,899
  • 9
  • 66
  • 104
user3042850
  • 323
  • 1
  • 3
  • 15

1 Answers1

3
import lxml.html as LH
import urllib2
url = 'http://espn.go.com/nba/team/roster/_/name/bos/boston-celtics'
doc = LH.parse(urllib2.urlopen(url))
print(doc.xpath('''
    //div[@id="my-players-table"]/div//table[1]//tr/td[2]/a/text()''')[1:])

yields

['Brandon Bass', 'Avery Bradley', 'Jae Crowder', 'Jeff Green', 'Jameer Nelson',
'Kelly Olynyk', 'Phil Pressey', 'Marcus Smart', 'Jared Sullinger', 'Marcus
Thornton', 'Evan Turner', 'Gerald Wallace', 'Brandan Wright', 'James Young',
'Tyler Zeller']

When scraping a page, the first thing to do is visually inspect the HTML received using urllib or requests:

import urllib2
url = 'http://espn.go.com/nba/team/roster/_/name/bos/boston-celtics'
response = urllib2.urlopen(url)
with open('/tmp/test.html', 'wb') as f:
    f.write(response.read())

Sometimes the HTML looks different than what you see in the GUI browser because urllib or requests does not process JavaScript. In that case other tools, such as selenium, may be needed. However, in this case, a text search for "Brandon Bass" shows the data is accessible in the HTML downloaded with urllib2:

<td class="sortcell"><a href="http://espn.go.com/nba/player/_/id/2745/brandon-bass">Brandon Bass</a></td>

Using the XPath you posted as a starting point, you can then use an interactive Python session to find the right XPath:

In [80]: import lxml.html as LH
In [81]: import urllib2
In [82]: url = 'http://espn.go.com/nba/team/roster/_/name/bos/boston-celtics'
In [83]: doc = LH.parse(urllib2.urlopen(url))
In [84]: [LH.tostring(elt) for elt in doc.xpath('//div[@id="my-players-table"]/div//table/tr')]
Out[84]: 
['<tr class="stathead"><td colspan="8">Team Roster</td></tr>',
 '<tr class="colhead"><td><a href="http://espn.go.com/nba/team/roster/_/name/bos/sort/jersey/order/false/boston-celtics">NO.</a></td><td><a href="http://espn.go.com/nba/team/roster/_/name/bos/order/false/boston-celtics">NAME</a></td><td>POS</td><td><a href="http://espn.go.com/nba/team/roster/_/name/bos/sort/age/order/false/boston-celtics">AGE</a></td><td><a href="http://espn.go.com/nba/team/roster/_/name/bos/sort/height/order/false/boston-celtics">HT</a></td><td><a href="http://espn.go.com/nba/team/roster/_/name/bos/sort/weight/order/false/boston-celtics">WT</a></td><td>COLLEGE</td><td>2014-2015 SALARY</td></tr>',
In [86]: [elt.text_content() for elt in doc.xpath('//div[@id="my-players-table"]/div//table/tr/td')]

which lead to

//div[@id="my-players-table"]/div//table[1]//tr/td[2]/a/text()

(Above, I made use of the LH.tostring function to inspect HTML snippets, and elt.text_content() to inspect the text contained in various elements.)


This is the first tutorial I read to understand XPath.

Once you get the basics under your belt, you can start reading the XPath v1.0 specification. There is also an XPath v2 and XPath v3, but current lxml only supports XPath 1.0.

Concurrently you can read the lxml docs, assuming you are using lxml.

I also found reading Stackoverflow XPath questions, such as this one, helpful.

Each time I encounter a new function or technique, I write a bit of demonstration code -- a minimal example -- showing (myself) how it works. That way, whenever I need to do XYZ again I can start from some runnable code.

Community
  • 1
  • 1
unutbu
  • 842,883
  • 184
  • 1,785
  • 1,677