How to parse text from a html table element

Question

I'm currently writing a small test webscraper using the python requests and lxml libraries. I'm trying to extract the text from the rows of a table from this site using xpaths to uniquely identify the table. Since the table itself can only be identified by its class name and given the fact that the class name isn't unique, I had to use the parent div element in order to order to specify the table. The table in question is that lists the dates of the season order, filming, and airdates for the show Game of thrones, which I'm trying to select with the following path:

tree.xpath('//div[@id = "mw-content-text"]//table[@class = "wikitable"]//text()')

For some reason, when I print this path in the shell, it returns an empty list. I believe that printing this path would simply display all of the text in the table which I was trying to do in order to ensure I could actually get the contents; however, I would actually need to print each row of the table.

Is there something wrong with this xpath? If so, what is the correct way to go about printing the table contents?

score 2 · Accepted Answer · answered Jul 31 '16 at 20:33

The wikitable is too broad of a class to distinguish tables on a wiki page between one another.

I would instead rely on the preceding Adaptation schedule label:

import requests
from lxml.html import fromstring

url = "https://en.wikipedia.org/wiki/Game_of_Thrones"
response = requests.get(url)
root = fromstring(response.content)

table = root.xpath(".//h3[span = 'Adaptation schedule']/following-sibling::table")[0]
for row in table.xpath(".//tr")[1:]:
    print([cell.text_content() for cell in row.xpath(".//td")])

Prints:

['Season 1', 'March 2, 2010[52]', 'Second half of 2010', 'April 17, 2011', 'June 19, 2011', 'A Game of Thrones']
['Season 2', 'April 19, 2011[53]', 'Second half of 2011', 'April 1, 2012', 'June 3, 2012', 'A Clash of Kings and some early chapters from A Storm of Swords[54]']
['Season 3', 'April 10, 2012[55]', 'Second half of 2012', 'March 31, 2013', 'June 9, 2013', 'About the first two-thirds of A Storm of Swords[56][57]']
['Season 4', 'April 2, 2013[58]', 'Second half of 2013', 'April 6, 2014', 'June 15, 2014', 'The remaining one-third of A Storm of Swords and some elements from A Feast for Crows and A Dance with Dragons[59]']
['Season 5', 'April 8, 2014[60]', 'Second half of 2014', 'April 12, 2015', 'June 14, 2015', 'A Feast for Crows, A Dance with Dragons and original content,[61] with some late chapters from A Storm of Swords[62] and elements from The Winds of Winter[63][64]']
['Season 6', 'April 8, 2014[60]', 'Second half of 2015', 'April 24, 2016', 'June 26, 2016', 'Original content and outlined from The Winds of Winter,[65][66] with some late elements from A Feast for Crows and A Dance with Dragons[67]']
['Season 7', 'April 21, 2016[50]', 'Second half of 2016[49]', 'Mid-2017[5]', 'Mid-2017[5]', 'Original content and outlined from The Winds of Winter and A Dream of Spring[66]']

How to parse text from a html table element

1 Answers1

Linked