1

I would like to select the following text:

Bold normal Italics

I need to select and get: Bold normal italist.

The html is:

<a href=""><strong>Bold</strong> normal <i>Italist</i></a>

However, a/text() yields

normal

only. Does anyone know a fix? I'm testing bing crawling, and the bold text is in different position depending on the query.

GRS
  • 2,807
  • 4
  • 34
  • 72
  • 1
    You need to understand [**the difference between text nodes and string values in XPath**](https://stackoverflow.com/a/41077106/290085) – kjhughes Jun 02 '17 at 16:20

2 Answers2

3

You can use a//text() instead of a/text() to get all text items.

# -*- coding: utf-8 -*-
from scrapy.selector import Selector

doc = """
<a href=""><strong>Bold</strong> normal <i>Italist</i></a>
"""

sel = Selector(text=doc, type="html")

result = sel.xpath('//a/text()').extract()
print result
# >>> [u' normal ']

result = u''.join(sel.xpath('//a//text()').extract())
print result
# >>> Bold normal Italist
Frank Martin
  • 2,584
  • 2
  • 22
  • 25
3

You can try to use

a/string()

or

normalize-space(a)

which returns Bold normal Italist

Andersson
  • 51,635
  • 17
  • 77
  • 129