2

I've written a script in python using xpath to parse tabular data from a webpage. Upon execution, it is able to parse the data from tables flawlessly. The only thing that I can't fix is parse the table header that means th tag. If I would do the same using css selector, i could have used .cssselect("th,td") but in case of xpath I got stuck. Any help as to how I could parse the data from th tag also will be highly appreciated.

Here is the script which is able to fetch everything from different tables except for the data within th tag:

import requests
from lxml.html import fromstring

response = requests.get("https://fantasy.premierleague.com/player-list/")
tree = fromstring(response.text)
for row in tree.xpath("//*[@class='ism-table']//tr"):
    tab_d = row.xpath('.//td/text()')
    print(tab_d)
SIM
  • 21,997
  • 5
  • 37
  • 109
  • What is desired output? Do you want to get th nodes along with td from each tr? – Andersson Dec 23 '17 at 22:27
  • Apology in advance to both of the xpath giants who care to provide me with excellent solutions. It's hard to choose a solution over the other. However, I'm considering as my selected answer the one I've got first. – SIM Dec 24 '17 at 06:05

2 Answers2

1

I'm not sure I get your point, but if you want to fetch both th and td nodes with single XPath, you can try to replace

tab_d = row.xpath('.//td/text()')

with

tab_d = row.xpath('.//*[name()=("th" or "td")]/text()')
Andersson
  • 51,635
  • 17
  • 77
  • 129
  • A little clarity would be much appreciated on how the term `name`appear within the xpath out of nowhere @ sir Andersson. – SIM Dec 24 '17 at 06:09
  • [`name()` function](https://developer.mozilla.org/en-US/docs/Web/XPath/Functions/name) (or [`local-name()`](https://stackoverflow.com/questions/2462248/what-is-the-difference-between-name-and-local-name)) might be used to check string representation of node name – Andersson Dec 24 '17 at 08:48
1

Change

.//td/text()

to

.//*[self::td or self::th]/text()

to include th elements too.

Note that it would be reasonable to assume that both td and th are immediate children of the tr context node, so you might further simplify your XPath to this:

*[self::td or self::th]/text()
kjhughes
  • 106,133
  • 27
  • 181
  • 240
  • Thanks sir kjhughes for your solution. It did the trick flawlessly as well. This is the first time I came across this term `self` within any xpath. I'm not sure I'll understand this style myself. +1 for your effective solution. – SIM Dec 24 '17 at 06:13