-1

I am trying to extract a text from forum posts, however the bold element is ignored.

How can I extract raw data like Some text to extract bold content? Currently I am getting only Some text to extract ?

<blockquote class="messageText SelectQuoteContainer ugc baseHtml">
Some text to extract <b>bold content</b>?
</blockquote>

def parse_page(self, response):
    for quote in response.css('article'):
        yield {
            'text': quote.css('blockquote::text').extract()
        }
Granitosaurus
  • 20,530
  • 5
  • 57
  • 82
anvd
  • 3,997
  • 19
  • 65
  • 126

2 Answers2

1

You need a space in your css selector:

'blockquote ::text'
           ^

Because you want text of every descending node under blockquote, without space it means just the text of blockquote node.

Granitosaurus
  • 20,530
  • 5
  • 57
  • 82
  • The not selector will stop working with the space? `blockquote:not(.bbCodeBlock) ::text` Apparently yes. – anvd Apr 13 '17 at 10:39
  • @anvd just tested, it should and does works fine. Tested: `'blockquote:not(.foo) ::text'` – Granitosaurus Apr 13 '17 at 10:42
  • the markup is a bit more complicated, and it will not work as expected https://jsfiddle.net/dwfmLcaj/ – anvd Apr 13 '17 at 11:34
  • @anvd This is not javascript. Scrapy converts all css selectors to xpath so the only css selector implementation that matters here is `cssselect` package, see: https://github.com/scrapy/cssselect. – Granitosaurus Apr 13 '17 at 11:38
  • thanks for the link, but currently the problem is the css. I don't even know how select that part of text that don't have any element associated. The problem is css for now – anvd Apr 13 '17 at 12:56
1

Use * selector to select text of all inner elements inside an element.

''.join([ a.strip() for a in quote.css('blockquote *::text').extract() ])

Umair Ayub
  • 19,358
  • 14
  • 72
  • 146