1

Using scrapy's XPath selector I'm trying to flatten the textual content of a div-element which either contains plain text or formatted HTML content. Here are two examples:

<div>
    <div itemprop="content">
        Lorem ipsum dolor sit amet, consectetur adipiscing elit. 
        <br>
        Donec fringilla est eu euismod varius.
    </div>

    <div itemprop="content">
        <p>Lorem ipsum dolor sit amet, consectetur adipiscing elit.</p>
        <p>Donec fringilla est eu euismod varius.</p>
        <p class="quote">
            <span>Quote</span>
            <a href="#">Exclude me</a>
            <ul>
                <li>Exclude me</li>
                <li>Exclude me</li>
            </ul>
        </p>
        <blockquote>Cras facilisis suscipit euismod.</blockquote>
    </div>
</div>

Now the goal is to omit the <p class="quote">Quote</p> in the flattened text as it only serves as visual cue for the blockquote following it. Due to the nature of the first example, i.e. text as immediate children of the selected div, the solution I've come up with looks as follows:

//div[@itemprop="content"]/descendant-or-self::*[not(self::script)]/text()[normalize-space()]

This accomplishes three things:

  1. Exclude <script> nodes as I don't want to include their text in my result.
  2. Exclude any nodes which don't contain any text.
  3. Include immediate textual children of my top-level div (via descendant-or-self).

Unfortunately it seems to me the latter is causing the <p class="quote">Quote</p> to be included despite of additional excluding filters, such as:

//div[@itemprop="content"]/descendant-or-self::*[not(self::script) and not(@class="quote")]/text()[normalize-space()]

//div[@itemprop="content"]/descendant-or-self::*[not(self::script)]/text()[normalize-space() and not(ancestor::*[@class="quote"])]

Iterating over the <div itemprop="content"> nodes the expected output is a list as such:

['Lorem ipsum dolor sit amet, consectetur adipiscing elit. Donec fringilla est eu euismod varius.',
'Lorem ipsum dolor sit amet, consectetur adipiscing elit. Donec fringilla est eu euismod varius. Cras facilisis suscipit euismod.']

Is there a way to solve this issue with a single XPath selector?

oschlueter
  • 2,596
  • 1
  • 23
  • 46
  • 1
    `//div[@itemprop="content"]/descendant-or-self::*[not(self::script) and not(@class="quote")]/text()[normalize-space()]` worked for me with your sample input. What are you getting? What are you expecting as output? (you can update the question with these answers) – paul trmbrth Jul 05 '16 at 11:27
  • Hello paul, thanks for pointing that out! My example input didn't represent the problem properly. The nodes I want to exclude have additional children that I need to exclude aswell. To give an example I added a -child and a list to the ```

    ``` but it could be any form of child here.

    – oschlueter Jul 05 '16 at 11:50

2 Answers2

1

Here's a way using EXSLT's set operations which scrapy supports (through lxml).

You probably need to adapt the XPath a bit, but the idea is the select all text nodes under under a parent element, and exclude those text nodes also under a descendant element of that parent.

Note: I had to change your input a bit because <p> can't contain <ul> and it was causing problem to lxml (used by scrapy by default under the hood)

>>> import scrapy
>>> t = r'''<div>
...     <div itemprop="content">
...         Lorem ipsum dolor sit amet, consectetur adipiscing elit. 
...         <br>
...         Donec fringilla est eu euismod varius.
...     </div>
... 
...     <div itemprop="content">
...         <p>Lorem ipsum dolor sit amet, consectetur adipiscing elit.</p>
...         <p>Donec fringilla est eu euismod varius.</p>
...         <div class="quote">
...             <ul>
...                 <li>Exclude me</li>
...                 <li>Exclude me</li>
...             </ul>
...             <span>Quote</span>
...             <a href="#test">Exclude me</a>
...         </div>
...         <blockquote>Cras facilisis suscipit euismod.</blockquote>
...     </div>
... </div>'''
>>> selector = scrapy.Selector(text=t, type='html')
>>> pprint(selector.xpath('''
               set:difference(
                   //div[@itemprop="content"]//text(),
                   //div[@class="quote"]//text())
           ''').extract())
['\n'
 '        Lorem ipsum dolor sit amet, consectetur adipiscing elit. \n'
 '        ',
 '\n        Donec fringilla est eu euismod varius.\n    ',
 '\n        ',
 'Lorem ipsum dolor sit amet, consectetur adipiscing elit.',
 '\n        ',
 'Donec fringilla est eu euismod varius.',
 '\n        ',
 '\n        ',
 'Cras facilisis suscipit euismod.',
 '\n    ']
>>> 
paul trmbrth
  • 20,518
  • 4
  • 53
  • 66
0

To point an item with an attribute, write it so: self::*[@class="quote"]

//div[@itemprop="content"]/descendant-or-self::*[not(self::script or self::*[@class="quote"])]/text()[normalize-space()]
splash58
  • 26,043
  • 3
  • 22
  • 34
  • Thanks for pointing that out, unfortunately that doesn't seem to cut it. Here are my observations: //div[@itemprop="content"]/descendant-or-self::*[@class="quote"] correctly selects what I want to exclude. //div[@itemprop="content"]/descendant-or-self::*[not(@class="quote")] however still selects the entire text including the quote. – oschlueter Jul 05 '16 at 09:41
  • See result of : `//div[@itemprop="content"]/descendant::*[not(@class="quote")]`. When you ask `descendant-or-self` you get content of the `div[@itemprop="content"]` whick includes `Quote` – splash58 Jul 05 '16 at 10:08
  • The same result will be if `save descendant-or-self` axis but append `/tex()` – splash58 Jul 05 '16 at 10:09