I am creating a new scrapy spider and everything is going pretty good, although I have a problem with one of the websites, where response.xpath is returning objects in the list which doesn't exist in html code:
{"pdf_name": ["\n\t\t\t\t\t\t\t\t\t", "ZZZZZZ", "\n\t\t\t\t\t\t\t\t\t", "PDF", "\n\t\t\t\t\t\t\t\t"],
{"pdf_name": ["\n\t\t\t\t\t\t\t\t\t\t", "YYYYYY", "\n\t\t\t\t\t\t\t\t\t\t", "XXXXXX"]}
As you can see below, these "empty" objects (\t and \n) are not included in HTML tags. If I understand correctly, xpath is including whitespaces before tags:
<div class="inner d-i-b va-t" role="group">
<a class="link-to" href="A.pdf" target="_blank">
<i class="offscreen">ZZZZZZ</i>
<span>PDF</span>
</a>
<div class="text-box">
<a href="A.pdf">
<i class="offscreen">YYYYYY</i>
<p>XXXXXX</p></a>
</div>
</div>
I know that I can strip() strings and remove white spaces, although it would only mitigate the issue, not remove the main problem, which is including white spaces in results.
Why is it happening? How to limit XPath results only to tags (I thought previously that it is done by default)?
Spider code - parse function (pdf_name is causing problems)
def parse(self, response):
# Select all links to pdfs
for pdf in response.xpath('//a[contains(@href, ".pdf")]'):
item = PdfItem()
# Create a list of text fields for links to PDFs and their descendants
item['pdf_name'] = pdf.xpath('descendant::text()').extract()
yield item