I'm trying to read specific parts of a webpage through XPath. The page is not very well-formed but I can't change that...
<root>
<div class="textfield">
<div class="header">First item</div>
Here is the text of the <strong>first</strong> item.
<div class="header">Second item</div>
<span>Here is the text of the second item.</span>
<div class="header">Third item</div>
Here is the text of the third item.
</div>
<div class="textfield">
Footer text
</div>
</root>
I want to extract the text of the various items, i.e. the text in between the header divs (e.g. 'Here is the text of the first item.'). I've used this XPath expression so far:
//text()[preceding::*[@class='header' and contains(text(),'First item')] and following::*[@class='header' and contains(text(),'Second item')]]
However, I cannot hardcode the ending item name because in the pages I want to scrape the order of the items differ (e.g. 'First item' may be followed by 'Third item').
Any help on how to adapt my XPath query would be greatly appreciated.