Using scrapy's XPath selector I'm trying to flatten the textual content of a div-element which either contains plain text or formatted HTML content. Here are two examples:
<div>
<div itemprop="content">
Lorem ipsum dolor sit amet, consectetur adipiscing elit.
<br>
Donec fringilla est eu euismod varius.
</div>
<div itemprop="content">
<p>Lorem ipsum dolor sit amet, consectetur adipiscing elit.</p>
<p>Donec fringilla est eu euismod varius.</p>
<p class="quote">
<span>Quote</span>
<a href="#">Exclude me</a>
<ul>
<li>Exclude me</li>
<li>Exclude me</li>
</ul>
</p>
<blockquote>Cras facilisis suscipit euismod.</blockquote>
</div>
</div>
Now the goal is to omit the <p class="quote">Quote</p>
in the flattened text as it only serves as visual cue for the blockquote
following it. Due to the nature of the first example, i.e. text as immediate children of the selected div
, the solution I've come up with looks as follows:
//div[@itemprop="content"]/descendant-or-self::*[not(self::script)]/text()[normalize-space()]
This accomplishes three things:
- Exclude
<script>
nodes as I don't want to include their text in my result. - Exclude any nodes which don't contain any text.
- Include immediate textual children of my top-level
div
(viadescendant-or-self
).
Unfortunately it seems to me the latter is causing the <p class="quote">Quote</p>
to be included despite of additional excluding filters, such as:
//div[@itemprop="content"]/descendant-or-self::*[not(self::script) and not(@class="quote")]/text()[normalize-space()]
//div[@itemprop="content"]/descendant-or-self::*[not(self::script)]/text()[normalize-space() and not(ancestor::*[@class="quote"])]
Iterating over the <div itemprop="content">
nodes the expected output is a list as such:
['Lorem ipsum dolor sit amet, consectetur adipiscing elit. Donec fringilla est eu euismod varius.',
'Lorem ipsum dolor sit amet, consectetur adipiscing elit. Donec fringilla est eu euismod varius. Cras facilisis suscipit euismod.']
Is there a way to solve this issue with a single XPath selector?
``` but it could be any form of child here.
– oschlueter Jul 05 '16 at 11:50