Scrapy 'normalize-space()' is truncating the whole string

Question

I am scraping an XML document like this:

>>> response.xpath("//ul[@class='meta-info d-flex flex-wrap align-items-center list-unstyled justify-content-around']/li[position()=2]/text()").extract()

and is giving me the following output:

['\n            ', '\n\t\t\t                ', '\n\t\t\t\t24 Feb, 2019        ', '\n            ', '\n\t\t\t                ', '\n\t\t\t\t24 Feb, 2019        ', '\n            ', '\n\t\t\t                ', '\n\t\t\t\t24 Feb, 2019        ', '\n            ', '\n\t\t\t                ', '\n\t\t\t\t24 Feb, 2019        ', '\n            ', '\n\t\t\t                ', '\n\t\t\t\t24 Feb, 2019        ', '\n            ', '\n\t\t\t                ', '\n\t\t\t\t24 Feb, 2019        ', '\n            ', '\n\t\t\t                ', '\n\t\t\t\t24 Feb, 2019        ', '\n            ', '\n\t\t\t                ', '\n\t\t\t\t24 Feb, 2019        ', '\n            ', '\n\t\t\t                ', '\n\t\t\t\t24 Feb, 2019        ', '\n            ', '\n\t\t\t                ', '\n\t\t\t\t24 Feb, 2019        ', '\n            ', '\n\t\t\t                ', '\n\t\t\t\t24 Feb, 2019        ', '\n            ', '\n\t\t\t                ', '\n\t\t\t\t24 Feb, 2019        ', '\n            ', '\n\t\t\t                ', '\n\t\t\t\t24 Feb, 2019        ', '\n            ', '\n\t\t\t                ', '\n\t\t\t\t23 Feb, 2019        ', '\n            ', '\n\t\t\t                ', '\n\t\t\t\t24 Feb, 2019        ', '\n            ', '\n\t\t\t                ', '\n\t\t\t\t24 Feb, 2019        ', '\n            ', '\n\t\t\t                ', '\n\t\t\t\t24 Feb, 2019        ', '\n            ', '\n\t\t\t                ', '\n\t\t\t\t24 Feb, 2019        ', '\n            ', '\n\t\t\t                ', '\n\t\t\t\t24 Feb, 2019        ', '\n            ', '\n\t\t\t                ', '\n\t\t\t\t24 Feb, 2019        ', '\n            ', '\n\t\t\t                ', '\n\t\t\t\t24 Feb, 2019        ']

But I do not want any fields that are either newlines, tabs or whitespaces, so I am trying to use the normalize-space() function, as follows:

>>> response.xpath("normalize-space(//ul[@class='meta-info d-flex flex-wrap align-items-center list-unstyled justify-content-around']/li[position()=2]/text())").extract()

But I am getting a null output:

['']

What is happening here?

score 1 · Accepted Answer · answered Jan 25 '19 at 08:10

1

I used regex to solve a similar problem, which I included below, if you want to test it. I found that it works well. This question should answer what is happening with normalize-space. It's expected to return an empty string on a text node.

import re
item_text = response.xpath("//ul[@class='meta-info d-flex flex-wrap align-items-center list-unstyled justify-content-around']/li[position()=2]/text()").extract()
re.sub('[\s]{2,}', '\n', "".join(item_text).strip())

answered Jan 25 '19 at 08:10

Matts

1,301
11
30

Just FYI, `[\s]{2,}` is `\s{2,}`, the character class is unnecessary. But you always should use a raw string for regex `r'\s{2,}'`. – Tomalak Jan 25 '19 at 10:17
Thanks, I did know about raw strings, it's not something I've always applied, but will from now on. – Matts Jan 25 '19 at 12:12

score 1 · Answer 2 · answered Jan 25 '19 at 08:18

normalize-space() works on a single string. You are giving it a whole list of nodes.

So it takes the first one, converts that to string, and returns the result. Your first node has a value of '\n '.

Write a for loop over //ul[@class='meta-info d-flex flex-wrap align-items-center list-unstyled justify-content-around']/li[position()=2] and call normalize-string() on the individual nodes.

Scrapy 'normalize-space()' is truncating the whole string

2 Answers2