0

I am trying to scrape a website but had problems with the Xpath expressions I was using on Scrapy's response objects.

From what I learned about XPath, I thought I was using the correct XPath expressions.

So I used a web browser to load the web page, then downloaded it and saved it as an HTML file.

Then I tried the XPath expressions two different ways.

The first way was to use Python's lxml.html module to open the file and load it as an HTMLParser object.

The second way was to use Scrapy and point it to the saved HTML file.

In both cases, I used the same XPath expression. But I get different results.

The sample HTML code is something like this (not exactly but I didn't want to post a huge chunk of code verbatim):

<html>
  <body>
    <div>
      <table type="games">
        <tbody>
          <tr row="1">
            <th data="week_number">1</th>
            <td data="date">"9/13/2020"</td>
          </tr>
        </tbody>
       </table>
     </div>
   </body>
</html>

For example, I'm trying to scrape the week number in the "TH" element under the "TR" element in the "TABLE".

I double checked the content by using Chrome, instead of Firefox, to Inspect the file (Firefox adds "tbody" elements to tables, according to this post: Parsing HTML with XPath, Python and Scrapy

The <tbody> element is in the file, according to Chrome's Inspect.

The first way was to open the HTML file using the lxml.html module:

from lxml import etree, html

if __name__ == '__main__':

    filename_04 = "/home/foo.html"

    # Try opening the filename
    try:
        fh_04 = open(filename_04, "r")
    except:
        print "Error opening %s.  Exiting" % filename_04
        sys.exit(1)

    # Try reading the contents of the HTML file.
    # Then close the file
    try:
        content_04 = fh_04.read().decode('utf-8')
    except UnicodeDecodeError:
        print "Error trying to read as UTF-8. Exiting."
        sys.exit(1)

    fh_04.close()

    # Define an HTML parser object
    parser_04 = html.HTMLParser()

    # Create a logical XML tree from the contents of parser_04
    tree_04 = html.parse(StringIO(content_04), parser_04)

    game_elements_list = list()

    # Get all the <TR> elements from the <table type="games">
    game_elements_list = tree_04.xpath("//table[@type = 'games']/tbody/tr")

    num_games = len(game_elements_list)
    # Now loop thru each of the <TR> element objects of game_elements_list 
    for x in range(num_games):
        # Parse the week number using xpath()
        # *** NOTE: this expression returns a list
        parsed_week_number = game_elements_list[x].xpath(".//th[@data = 'week_number']/text()")
                                                   
        print ":: parsed_week_number: ", str(parsed_week_number)
        p_type = type(parsed_week_number)
        print ":: p_type: ", str(p_type)

Using the XPath expressions via the lxml.html module returns this output:

:: parsed_week_number:  ['1']
:: p_type:  <type 'list'>

This is what I expect from the XPath expressions so my XPath expressions are correct.

However, when I point the Scrapy spider to the local file, I get different results:

    # I'm only posting the callback method, not the
    # method that makes the actual request, because
    # the request() call works
    def parse_schedule_page(self, response):

        game_elements_list = list()
        # The xpath expression is the same as the one used in the file that
        # uses lxml.html module
        game_elements_list = response.xpath("//table[@type = 'games']/tbody/tr")
        num_game_elements = len(game_elements_list)

        for i in range(num_game_elements):
            # Again, the XPath expression is the same
            # as the one used in the file that 
            # uses the lxml.html module
            parsed_week_number = game_elements_list[i].xpath(".//th[@data = 'week_number']/text()")
            stmt = ":: parsed_week_number: " + str(parsed_week_number)
            self.log(stmt)
            p_type = type(parsed_week_number)
            stmt = "p_type: " + str(p_type)
            self.log(stmt)

            """
            To get the week number, I have to add the following line:
            week_number = parsed_week_number.extract()
            """

But in the case of the Spider, the output is different:

2020-07-17 21:22:30 [test_schedule] DEBUG: :: parsed_week_number: [<Selector xpath=".//th[@data-stat = 'week_num']/text()" data=u'1'>]
2020-07-17 21:22:30 [test_schedule] DEBUG: p_type: <class 'scrapy.selector.unified.SelectorList'>

The same XPath expression doesn't return the contents of <th data="week_number">1</th>

I know Scrapy uses a different extractor method than lxml's HTMLParser. But no matter how the HTML data is stored, shouldn't XPath expressions work the same even if the extractor methods were different?

Does Scrapy's response.xpath() method evaluate XPath expressions differently than lxml.html's xpath() method?

SQA777
  • 352
  • 5
  • 15

2 Answers2

1

To answer your question Scrapy imports lxml internally and XML Path language is standardised albeit not been updated in a while. So your XPATH expressions should be the same.

To further help you, an URL would be good for the specific XPATH selector you're struggling with.

Tips

As a general rule if I cant get the XPATH selector to work when running the script I go to the scrapy shell and work it out. Generally speaking I tend to work in scrapy shell with a list of the data I want and try out the xpath in there to confirm tha it'll be picked up in the script before writing my scrapy spiders.

Additional Information

For more informtion on XPATH see here

It's worth looking at the Scrapy codebase if you have questions like this about the internals, even if you don't think you'll understand a lot of it.

In the Scrapy Docs here references the response.xpath method but you also get access to the source if you just click the source text.

Below is the relevant codebase for the xpath method including the imports.

response.xpath imports

"""
XPath selectors based on lxml
"""

import sys

import six
from lxml import etree, html

response.xpath method

def xpath(self, query, namespaces=None, **kwargs):
        """
        Find nodes matching the xpath ``query`` and return the result as a
        :class:`SelectorList` instance with all elements flattened. List
        elements implement :class:`Selector` interface too.

        ``query`` is a string containing the XPATH query to apply.

        ``namespaces`` is an optional ``prefix: namespace-uri`` mapping (dict)
        for additional prefixes to those registered with ``register_namespace(prefix, uri)``.
        Contrary to ``register_namespace()``, these prefixes are not
        saved for future calls.

        Any additional named arguments can be used to pass values for XPath
        variables in the XPath expression, e.g.::

            selector.xpath('//a[href=$url]', url="http://www.example.com")
        """

        try:
            xpathev = self.root.xpath
        except AttributeError:
            return self.selectorlist_cls([])

        nsp = dict(self.namespaces)
        if namespaces is not None:
            nsp.update(namespaces)
        try:
            result = xpathev(query, namespaces=nsp,
                             smart_strings=self._lxml_smart_strings,
                             **kwargs)
        except etree.XPathError as exc:
            msg = u"XPath error: %s in %s" % (exc, query)
            msg = msg if six.PY3 else msg.encode('unicode_escape')
            six.reraise(ValueError, ValueError(msg), sys.exc_info()[2])

        if type(result) is not list:
            result = [result]

        result = [self.__class__(root=x, _expr=query,
                                 namespaces=self.namespaces,
                                 type=self.type)
                  for x in result]
        return self.selectorlist_cls(result)
AaronS
  • 2,245
  • 2
  • 6
  • 16
1

AaronS answer is very complete and thorough, but I believe he missed the problem in your code. It's a simple mistake, it goes by unnoticed.

According to your logs:

2020-07-17 21:22:30 [test_schedule] DEBUG: :: parsed_week_number: [<Selector xpath=".//th[@data-stat = 'week_num']/text()" data=u'1'>]
2020-07-17 21:22:30 [test_schedule] DEBUG: p_type: <class 'scrapy.selector.unified.SelectorList'>

You can see in the first line that the value for the parsed_week_number is a list with one Selector object and even that this object has a data attribute with the value of 1. So your selector is selecting the correct XPath, however to extract the data it selected you will need to use the methods .get() or .getall().

The .get() will return the data of the first selector in the list (in your case the list has only one) as a string, while .getall() will return the data of all the selectors in the list as a list of strings. You can read more about those methods here.

Effectively you need to correct this line:

        parsed_week_number = game_elements_list[i].xpath(".//th[@data = 'week_number']/text()")

To this:

        parsed_week_number = game_elements_list[i].xpath(".//th[@data = 'week_number']/text()").get()
renatodvc
  • 2,526
  • 2
  • 6
  • 17
  • 1
    I know how to fix the problem. I could get to the week number by issuing this line: ``` week_number=parsed_week_number.extract()[0] ``` And I got around having to deal with extract() or get() by converting the response.body to an html object. However, the gist of my question is why an XPath expression returns different results from a Scrapy response than from an html object. I know Scrapy uses a different type of selector but I thought XPath cut through all of that. – SQA777 Jul 23 '20 at 01:51
  • 1
    Happy to have my answer recinded and renatodvc's to be accepted. I think when I was posting I thought it would've been helpful to have a link to make sure the XPATH selectors were indeed correct. Renatodvc provides great advice, advice I seem to give in a lot of posts lately. It would still be useful to know what the URL is to see the differences in HTML and response scrapy gives. Sometimes this is due to what scrapy can actually parse. Javscript orientated websites will not be completely and accurately parsed. – AaronS Jul 23 '20 at 06:28
  • Very kind of you @AaronS, but there is no need. I think your answer undoubtedly contributes more for those who end up in this question in the future! I'm always happy to see your thoughtful and complete answers here in SO. :) – renatodvc Jul 23 '20 at 13:21