4

Below is a mock-up of a document I'm working on:

<div>
<h4>Area</h4>
  <span class="aclass"> </span>
  <span class="bclass">
        <strong>Address:</strong>
  10 Downing Street

  London

  SW1
  </span>
</div>

I'm getting the address like this:

response.xpath(u".//h4[. = 'Area']/following-sibling::span[contains(.,'Address:')]/text()").extract()

which returns

[u'\r\n  \t', u'\r\n  10 Downing Street\r\n\r\n  London     \r\n  \r\n  SW1\r\n  ']

I'm trying to clean that up with normalize-space. I've tried putting it in every location I could think of, but it either tells me there's a syntax error, or returns an empty string.

Updating to add that I'm trying to get this working without changing the selector too much. I have similar cases which don't have the <strong> tag, for example. The selector is overcomplicated in the example I've prepared here, but in the live version, I have to take that rather convoluted route to get to the address.

Regarding the possible duplicate Following the advice in the possible duplicate, I added /normalize-space(.) giving this:

(u".//h4[. = 'Area']/following-sibling::span[contains(.,'Address:')]/text()/normalize-space(.)").extract()

That produces a ValueError: Invalid XPath: error.

user3185563
  • 1,314
  • 2
  • 15
  • 22
  • Regarding the duplicate question reference: `.//h4[. = 'Area']/following-sibling::span[contains(.,'Address:')]/text()/normalize-space(.)` is valid in XPath 2, but **not in XPath 1.0** (which scrapy supports only, on top of lxml/libxml2). Citing the accepted answer [there](https://stackoverflow.com/questions/3359512/is-it-possible-to-apply-normalize-space-to-all-nodes-xpath-expression-finds): _"In XPath 2.0 a location step of an XPath expression may be a function reference"_. This is not possible with XPath 1.0 – paul trmbrth Nov 25 '15 at 13:16
  • Another option is to use `normalize-space()` or `string()` on the `` with the address and use regular expression chaining `.re(r)` with `r=re.compile(r'Address:(.*)', re.S)` or similar. `selector.xpath('.//h4[.="Area"]/following-sibling::span[starts-with(normalize-space(), "Address")]').xpath('string()').re(r)` would give you `[u'\n 10 Downing Street\n\n London\n\n SW1\n ']` (I use `string()` because newlines can be important, and `normalize-space()` will replace them with space) – paul trmbrth Nov 25 '15 at 13:17

3 Answers3

4

You can locate the strong element, get the following text sibling and normalize it:

In [1]: response.xpath(u"normalize-space(.//strong[. = 'Address:']/following-sibling::text())").extract()
Out[1]: [u'10 Downing Street London SW1']

Alternatively, you can look into Item Loaders and input and output processors. I often use Join(), TakeFirst() and MapCompose(unicode.strip) for cleaning up the extracted data from extra newlines or spaces.

alecxe
  • 462,703
  • 120
  • 1,088
  • 1,195
  • 2
    I feel that, depending on what you're doing with the data, Item Loaders would be the way to go. This is one of the main thing it's designed to do - data sanitation/formatting. – Rejected Nov 25 '15 at 16:55
3
"normalize-space(//strong[contains(text(), 'Address:')]/following-sibling::node())"
eLRuLL
  • 18,488
  • 9
  • 73
  • 99
  • The original version of you answer was similar to this: (u"normalize-space(//h4[. = 'Area']/following-sibling::span[contains(.,'Address:')])").extract() That actually seems to work. I was just wondering if you saw a particular problem with that. Is there any reason not to use it? – user3185563 Nov 25 '15 at 12:44
1

Since you're using Scrapy you can simplify your XPath using a Python one-liner:

" ".join(s.split()) # where `s` is your string

Using the above you can omit normalize-space from your XPath expression and, instead, create a reusable sanitization function using Scrapy Input Processors like so:

import scrapy
from scrapy.loader.processors import MapCompose
from w3lib.html import remove_tags

def normalize_space(value):
    return " ".join(value.split())

class Product(scrapy.Item):
    name = scrapy.Field(
        input_processor=MapCompose(remove_tags, normalize_space),
    )

Alternatively you could also use the Python expression inside a Scrapy Item Loader like this:

import scrapy
from scrapy.loader import ItemLoader
from scrapy.loader.processors import Compose

class ProductLoader(ItemLoader):
    name_in = Compose(lambda s: " ".join(s.split()))

Credit for the one-liner goes to Tom's answer in a related question.

vhs
  • 9,316
  • 3
  • 66
  • 70