1

Is it possible to scrape XPATH non-greedy-ly? I mean for example I have this HTML:

<div>
    <p>A</p>
    <p>B</p>
    <h2>Only until this node</h2>
    <p>I should not get this</p>
    <h2>Even though this node exists</h2>
</div>

I want an XPATH which only gets the paragraphs with A and B inside. The text inside the nearest h2 node is always changing, so I need non-greedy XPATH if it is possible. Is it possible? And how?

Aminah Nuraini
  • 18,120
  • 8
  • 90
  • 108

3 Answers3

2

I assume <h2>Only until this node</h2> is dynamic, you can select first index of h2

//div/h2[1]/preceding-sibling::p

var htmlString = `
<body>
  <div>
    <p>A</p>
    <p>B</p>
    <h2>Only until this node</h2>
    <p>I should not get this</p>
    <h2>Even though this node exists</h2>
  </div>
  <div>
    <p>A1</p>
    <p>B2</p>
    <p>C3</p>
    <h2>Second Only until this node</h2>
    <p>I should not get this</p>
    <h2>Even though this node exists</h2>
  </div>
</body>`;

var doc = new DOMParser().parseFromString(htmlString, 'text/xml');
var iterator = doc.evaluate('//div/h2[1]/preceding-sibling::p', doc, null, XPathResult.UNORDERED_NODE_ITERATOR_TYPE, null);
var thisNode = iterator.iterateNext();
while (thisNode) {
  console.log(thisNode.outerHTML);
  thisNode = iterator.iterateNext();
}
ewwink
  • 18,382
  • 2
  • 44
  • 54
1

Try this xpath

//div/p[following::h2[contains(.,'Only until this node')]]

to get desired content out of the html elements until it hits the p element containing this text Only until this node.

Check out the example below:

from scrapy import Selector

htmldoc="""
<div>
    <p>A</p>
    <p>B</p>
    <p>C</p>
    <p>D</p>
    <h2>Only until this node</h2>
    <p>E</p>
    <p>F</p>
    <p>I should not get this</p>
    <h2>Even though this node exists</h2>
    <p>I should not even this</p>
</div>
"""
sel = Selector(text=htmldoc)
for item in sel.xpath("//div/p[following::h2[contains(.,'Only until this node')]]/text()").extract():
    print(item)

What it produces:

A
B
C
D
SIM
  • 21,997
  • 5
  • 37
  • 109
0

You can try the following XPath-1.0 expression:

/div/p[following-sibling::*[self::h2='Only until this node']]

It gets all p elements which have a h2 successor with the text() value "Only until this node".

zx485
  • 28,498
  • 28
  • 50
  • 59