I am trying to scrape a website but had problems with the Xpath expressions I was using on Scrapy's response objects.
From what I learned about XPath, I thought I was using the correct XPath expressions.
So I used a web browser to load the web page, then downloaded it and saved it as an HTML file.
Then I tried the XPath expressions two different ways.
The first way was to use Python's lxml.html module to open the file and load it as an HTMLParser object.
The second way was to use Scrapy and point it to the saved HTML file.
In both cases, I used the same XPath expression. But I get different results.
The sample HTML code is something like this (not exactly but I didn't want to post a huge chunk of code verbatim):
<html>
<body>
<div>
<table type="games">
<tbody>
<tr row="1">
<th data="week_number">1</th>
<td data="date">"9/13/2020"</td>
</tr>
</tbody>
</table>
</div>
</body>
</html>
For example, I'm trying to scrape the week number in the "TH" element under the "TR" element in the "TABLE".
I double checked the content by using Chrome, instead of Firefox, to Inspect the file (Firefox adds "tbody" elements to tables, according to this post: Parsing HTML with XPath, Python and Scrapy
The <tbody>
element is in the file, according to Chrome's Inspect.
The first way was to open the HTML file using the lxml.html module:
from lxml import etree, html
if __name__ == '__main__':
filename_04 = "/home/foo.html"
# Try opening the filename
try:
fh_04 = open(filename_04, "r")
except:
print "Error opening %s. Exiting" % filename_04
sys.exit(1)
# Try reading the contents of the HTML file.
# Then close the file
try:
content_04 = fh_04.read().decode('utf-8')
except UnicodeDecodeError:
print "Error trying to read as UTF-8. Exiting."
sys.exit(1)
fh_04.close()
# Define an HTML parser object
parser_04 = html.HTMLParser()
# Create a logical XML tree from the contents of parser_04
tree_04 = html.parse(StringIO(content_04), parser_04)
game_elements_list = list()
# Get all the <TR> elements from the <table type="games">
game_elements_list = tree_04.xpath("//table[@type = 'games']/tbody/tr")
num_games = len(game_elements_list)
# Now loop thru each of the <TR> element objects of game_elements_list
for x in range(num_games):
# Parse the week number using xpath()
# *** NOTE: this expression returns a list
parsed_week_number = game_elements_list[x].xpath(".//th[@data = 'week_number']/text()")
print ":: parsed_week_number: ", str(parsed_week_number)
p_type = type(parsed_week_number)
print ":: p_type: ", str(p_type)
Using the XPath expressions via the lxml.html module returns this output:
:: parsed_week_number: ['1']
:: p_type: <type 'list'>
This is what I expect from the XPath expressions so my XPath expressions are correct.
However, when I point the Scrapy spider to the local file, I get different results:
# I'm only posting the callback method, not the
# method that makes the actual request, because
# the request() call works
def parse_schedule_page(self, response):
game_elements_list = list()
# The xpath expression is the same as the one used in the file that
# uses lxml.html module
game_elements_list = response.xpath("//table[@type = 'games']/tbody/tr")
num_game_elements = len(game_elements_list)
for i in range(num_game_elements):
# Again, the XPath expression is the same
# as the one used in the file that
# uses the lxml.html module
parsed_week_number = game_elements_list[i].xpath(".//th[@data = 'week_number']/text()")
stmt = ":: parsed_week_number: " + str(parsed_week_number)
self.log(stmt)
p_type = type(parsed_week_number)
stmt = "p_type: " + str(p_type)
self.log(stmt)
"""
To get the week number, I have to add the following line:
week_number = parsed_week_number.extract()
"""
But in the case of the Spider, the output is different:
2020-07-17 21:22:30 [test_schedule] DEBUG: :: parsed_week_number: [<Selector xpath=".//th[@data-stat = 'week_num']/text()" data=u'1'>]
2020-07-17 21:22:30 [test_schedule] DEBUG: p_type: <class 'scrapy.selector.unified.SelectorList'>
The same XPath expression doesn't return the contents of <th data="week_number">1</th>
I know Scrapy uses a different extractor method than lxml's HTMLParser. But no matter how the HTML data is stored, shouldn't XPath expressions work the same even if the extractor methods were different?
Does Scrapy's response.xpath() method evaluate XPath expressions differently than lxml.html's xpath() method?