Extract text with bold content from css selector

Question

I am trying to extract a text from forum posts, however the bold element is ignored.

How can I extract raw data like Some text to extract bold content? Currently I am getting only Some text to extract ?

<blockquote class="messageText SelectQuoteContainer ugc baseHtml">
Some text to extract <b>bold content</b>?
</blockquote>

def parse_page(self, response):
    for quote in response.css('article'):
        yield {
            'text': quote.css('blockquote::text').extract()
        }

Granitosaurus · Answer 1 · 2017-04-13T10:36:34.107

1

You need a space in your css selector:

'blockquote ::text'
           ^

Because you want text of every descending node under blockquote, without space it means just the text of blockquote node.

edited Apr 13 '17 at 10:36

answered Apr 13 '17 at 10:19

Granitosaurus

20,530
5
57
82

The not selector will stop working with the space? `blockquote:not(.bbCodeBlock) ::text` Apparently yes. – anvd Apr 13 '17 at 10:39
@anvd just tested, it should and does works fine. Tested: `'blockquote:not(.foo) ::text'` – Granitosaurus Apr 13 '17 at 10:42
the markup is a bit more complicated, and it will not work as expected https://jsfiddle.net/dwfmLcaj/ – anvd Apr 13 '17 at 11:34
@anvd This is not javascript. Scrapy converts all css selectors to xpath so the only css selector implementation that matters here is `cssselect` package, see: https://github.com/scrapy/cssselect. – Granitosaurus Apr 13 '17 at 11:38
thanks for the link, but currently the problem is the css. I don't even know how select that part of text that don't have any element associated. The problem is css for now – anvd Apr 13 '17 at 12:56

score 1 · Answer 2 · answered Apr 13 '17 at 11:10

1

Use * selector to select text of all inner elements inside an element.

''.join([ a.strip() for a in quote.css('blockquote *::text').extract() ])

answered Apr 13 '17 at 11:10

Umair Ayub

19,358
14
72
146

Extract text with bold content from css selector

2 Answers2

Linked