XPath in scrapy returns elements which don't exist

Question

I am creating a new scrapy spider and everything is going pretty good, although I have a problem with one of the websites, where response.xpath is returning objects in the list which doesn't exist in html code:

{"pdf_name": ["\n\t\t\t\t\t\t\t\t\t", "ZZZZZZ", "\n\t\t\t\t\t\t\t\t\t", "PDF", "\n\t\t\t\t\t\t\t\t"],
{"pdf_name": ["\n\t\t\t\t\t\t\t\t\t\t", "YYYYYY", "\n\t\t\t\t\t\t\t\t\t\t", "XXXXXX"]}

As you can see below, these "empty" objects (\t and \n) are not included in HTML tags. If I understand correctly, xpath is including whitespaces before tags:

<div class="inner d-i-b va-t" role="group">
                        <a class="link-to" href="A.pdf" target="_blank">
                                    <i class="offscreen">ZZZZZZ</i>
                                    <span>PDF</span>
                                </a>

                                <div class="text-box">
                                    <a href="A.pdf">
                                        <i class="offscreen">YYYYYY</i>
                                        <p>XXXXXX</p></a>
                                </div>
                            </div>

I know that I can strip() strings and remove white spaces, although it would only mitigate the issue, not remove the main problem, which is including white spaces in results.

Why is it happening? How to limit XPath results only to tags (I thought previously that it is done by default)?

Spider code - parse function (pdf_name is causing problems)

def parse(self, response):

    # Select all links to pdfs
    for pdf in response.xpath('//a[contains(@href, ".pdf")]'):
        item = PdfItem()

        # Create a list of text fields for links to PDFs and their descendants
        item['pdf_name'] = pdf.xpath('descendant::text()').extract()

        yield item

Since the op is in JSON format you are seeing that \t and \n. If you load them into DB you will have the necessary white space. — backtrack, Sep 19 '16 at 09:23
Thanks @Backtrack for info. The thing is that I don't want to have whitespaces nor \t and \n - it simply shouldn't be included in results. I am looking for text in tags, not formatting outside of them. Any ideas how to improve this? — Starid, Sep 19 '16 at 09:25
here is a example :http://stackoverflow.com/questions/5992177/what-is-the-difference-between-normalize-space-and-normalize-spacetext — backtrack, Sep 19 '16 at 09:28

Tomalak · Accepted Answer · 2016-09-19T09:40:11.433

Whitespace is part of the document. Just because you think it is unimportant does not make it go away.

A text node is a text node, whether it consists of ' ' (the space character) or any other character makes no difference at all.

You can normalize the whitespace with the normalize-space() XPath function:

def parse(self, response):
    for pdf_link in response.xpath('//a[contains(@href, ".pdf")]'):
        item = PdfItem()
        item['pdf_name'] = pdf_link.xpath('normalize-space(.)').extract()
        yield item

First, normalize-space() converts its argument to string, which is done by concatenating all descendant text nodes. Then it trims leading and trailing spaces and collapses any consecutive whitespace (including line breaks) into a single space. Something like this '\n bla \n\n bla ' would become 'bla bla'.

Thank you for your valuable comment and answer! – Starid Sep 19 '16 at 18:22 — Starid, Sep 19 '16 at 18:22

XPath in scrapy returns elements which don't exist

1 Answers1