Scrapy: How to get a correct selector

Question

I would like to select the following text:

Bold normal Italics

I need to select and get: Bold normal italist.

The html is:

<a href=""><strong>Bold</strong> normal <i>Italist</i></a>

However, a/text() yields

normal

only. Does anyone know a fix? I'm testing bing crawling, and the bold text is in different position depending on the query.

You need to understand [**the difference between text nodes and string values in XPath**](https://stackoverflow.com/a/41077106/290085) — kjhughes, Jun 02 '17 at 16:20

score 3 · Accepted Answer · answered Jun 02 '17 at 16:05

You can use a//text() instead of a/text() to get all text items.

# -*- coding: utf-8 -*-
from scrapy.selector import Selector

doc = """
<a href=""><strong>Bold</strong> normal <i>Italist</i></a>
"""

sel = Selector(text=doc, type="html")

result = sel.xpath('//a/text()').extract()
print result
# >>> [u' normal ']

result = u''.join(sel.xpath('//a//text()').extract())
print result
# >>> Bold normal Italist

score 3 · Answer 2 · answered Jun 02 '17 at 16:06

3

You can try to use

a/string()

or

normalize-space(a)

which returns Bold normal Italist

answered Jun 02 '17 at 16:06

Andersson

51,635
17
77
129

Scrapy only supports XPath 1.0, so `a/string()` will not work. – paul trmbrth Jun 02 '17 at 16:40

Scrapy: How to get a correct selector

2 Answers2