0

I am creating a new scrapy spider and everything is going pretty good, although I have a problem with one of the websites, where response.xpath is returning objects in the list which doesn't exist in html code:

{"pdf_name": ["\n\t\t\t\t\t\t\t\t\t", "ZZZZZZ", "\n\t\t\t\t\t\t\t\t\t", "PDF", "\n\t\t\t\t\t\t\t\t"],
{"pdf_name": ["\n\t\t\t\t\t\t\t\t\t\t", "YYYYYY", "\n\t\t\t\t\t\t\t\t\t\t", "XXXXXX"]}

As you can see below, these "empty" objects (\t and \n) are not included in HTML tags. If I understand correctly, xpath is including whitespaces before tags:

<div class="inner d-i-b va-t" role="group">
                        <a class="link-to" href="A.pdf" target="_blank">
                                    <i class="offscreen">ZZZZZZ</i>
                                    <span>PDF</span>
                                </a>

                                <div class="text-box">
                                    <a href="A.pdf">
                                        <i class="offscreen">YYYYYY</i>
                                        <p>XXXXXX</p></a>
                                </div>
                            </div>

I know that I can strip() strings and remove white spaces, although it would only mitigate the issue, not remove the main problem, which is including white spaces in results.

Why is it happening? How to limit XPath results only to tags (I thought previously that it is done by default)?

Spider code - parse function (pdf_name is causing problems)

def parse(self, response):

    # Select all links to pdfs
    for pdf in response.xpath('//a[contains(@href, ".pdf")]'):
        item = PdfItem()

        # Create a list of text fields for links to PDFs and their descendants
        item['pdf_name'] = pdf.xpath('descendant::text()').extract()

        yield item
Starid
  • 31
  • 6
  • Since the op is in JSON format you are seeing that \t and \n. If you load them into DB you will have the necessary white space. – backtrack Sep 19 '16 at 09:23
  • Thanks @Backtrack for info. The thing is that I don't want to have whitespaces nor \t and \n - it simply shouldn't be included in results. I am looking for text in tags, not formatting outside of them. Any ideas how to improve this? – Starid Sep 19 '16 at 09:25
  • here is a example :http://stackoverflow.com/questions/5992177/what-is-the-difference-between-normalize-space-and-normalize-spacetext – backtrack Sep 19 '16 at 09:28

1 Answers1

2

Whitespace is part of the document. Just because you think it is unimportant does not make it go away.

A text node is a text node, whether it consists of ' ' (the space character) or any other character makes no difference at all.

You can normalize the whitespace with the normalize-space() XPath function:

def parse(self, response):
    for pdf_link in response.xpath('//a[contains(@href, ".pdf")]'):
        item = PdfItem()
        item['pdf_name'] = pdf_link.xpath('normalize-space(.)').extract()
        yield item

First, normalize-space() converts its argument to string, which is done by concatenating all descendant text nodes. Then it trims leading and trailing spaces and collapses any consecutive whitespace (including line breaks) into a single space. Something like this '\n bla \n\n bla ' would become 'bla bla'.

Tomalak
  • 332,285
  • 67
  • 532
  • 628