Extracting paragraph text including other element's content using Scrapy Selector

Question

Using Scrapy 0.24 Selectors, I want to extract the paragraph content including other element's content (in the following exemple, it'd be the anchor <a>. How can I achieve that?

The Code

>>> from scrapy import Selector
>>> html = """
        <html>
            <head>
                <title>Test</title>
            </head>
            <body>
                <div>
                    <p>Hello, can I get this paragraph content without this <a href="http://google.com">Google link</a>?
                </div>
            </body>
        </html>
        """
>>> sel = Selector(text=html, type="html")
>>> sel.xpath('//p/text()').extract()
[u'Hello, can I get this paragraph content with this ', u'?']

Output

[u'Hello, can I get this paragraph content with this ', u'?']

Expected output

[u'Hello, can I get this paragraph content with this Google link?']

Hm. You could first extract the contents of what's in ` – Aleksander Lidtke Jan 26 '15 at 23:25 — Aleksander Lidtke, Jan 26 '15 at 23:25

score 0 · Accepted Answer · edited May 23 '17 at 10:25

I would recommend BeautifulSoup. While scrapy is a complete crawling framework, BS is a strong parsing library (Difference between BeautifulSoup and Scrapy crawler?).

Doc: http://www.crummy.com/software/BeautifulSoup/bs4/doc/

Install: pip install beautifulsoup4

For your case:

# 'html' is the one your provided
from bs4 import BeautifulSoup
soup = BeautifulSoup(html)
res = [p.get_text().strip() for p in soup.find_all('p')]

Result:

[u'Hello, can I get this paragraph content without this Google link?']

Extracting paragraph text including other element's content using Scrapy Selector

The Code

Output

Expected output

1 Answers1