5

I'm trying to read specific parts of a webpage through XPath. The page is not very well-formed but I can't change that...

<root>
    <div class="textfield">
        <div class="header">First item</div>
        Here is the text of the <strong>first</strong> item.
        <div class="header">Second item</div>
        <span>Here is the text of the second item.</span>
        <div class="header">Third item</div>
        Here is the text of the third item.
    </div>
    <div class="textfield">
        Footer text
    </div>
</root>

I want to extract the text of the various items, i.e. the text in between the header divs (e.g. 'Here is the text of the first item.'). I've used this XPath expression so far:

//text()[preceding::*[@class='header' and contains(text(),'First item')] and following::*[@class='header' and contains(text(),'Second item')]]

However, I cannot hardcode the ending item name because in the pages I want to scrape the order of the items differ (e.g. 'First item' may be followed by 'Third item').

Any help on how to adapt my XPath query would be greatly appreciated.

  • Would /root/text() work? This should return text nodes under the root element so you would have to iterate over it. – Pawel Apr 17 '12 at 05:03
  • Perhaps, but then how would I be able to find out to which item they belong? Another thing: some of the text also contains tags (like etc.) and with root/text() these are not included... – Michiel Meulendijk Apr 17 '12 at 09:15

3 Answers3

2

//*[@class='header' and contains(text(),'First item')]/following::text()[1] will select first text node after <div class="header">First item</div>.
//*[@class='header' and contains(text(),'Second item')]/following::text()[1] will select first text node after <div class="header">Second item</div> and so on
EDIT: Sorry, this will not work for <strong> cases. Will update my answer
EDIT2: Used @Michiel part. Looks like omg but works: //div[@class='textfield'][1]//text()[preceding::*[@class='header' and contains(text(),'First item')]][following::*[preceding::*[not(self::strong) and not(self::span)][1][contains(text(),'First item')]] or not(//*[preceding::*[@class='header' and contains(text(),'First item')]])]
Seems that this should be solved with a better solution :)

Aleh Douhi
  • 1,958
  • 1
  • 14
  • 13
  • Thanks, but I can't get it to work? I edited the xml in the first post. Can I use the first instance of class=textfield to limit the contents of the last item? – Michiel Meulendijk Apr 17 '12 at 18:55
  • @MichielMeulendijk - you can insert `//div[@class='textfield'][1]` to the beginning of the xpath, so it will be look like `//div[@class='textfield'][1]//text()...`. But I see `` here too, so you should modify `[not(self::strong)]` to `[not(self::strong) and not(self::span)]`. Edited initial answer – Aleh Douhi Apr 17 '12 at 19:18
  • The [1] addition doesn't seem to work... It still selects the footer text too. If I can somehow limit the contents to the first div, I wouldn't need your second solution. I'm sure that one works too, but I don't know what tags the contents may hold: span, strong, img, etc. It could be anything, so it would be a lot of work to manually add all possibilities... – Michiel Meulendijk Apr 17 '12 at 19:32
  • @MichielMeulendijk - Hard to say what the problem is by given part of xml. Try for example this `//descendant::div[@class='textfield'][1]/div[.='First item']/following::div[1]/preceding::text()[preceding::div[.='First item']]`. Is this the first div with textfield class in xml, btw? – Aleh Douhi Apr 17 '12 at 20:05
  • Got it! Apparently [1] doesn't work with //, only with /. When I used position() it did work: //*[ @class='header' and position() = 1 ] //text() [...] Thanks for all your help! – Michiel Meulendijk Apr 17 '12 at 20:38
  • `/following::text()[1]` is exactly what I was looking for – Nakilon Dec 09 '15 at 16:13
2

Found it!

//text()[preceding::*[@class='header' and contains(text(),'First item')]][following::*[preceding::*[@class='header'][1][contains(text(),'First item')]]]

Indeed your solution, Aleh, won't work for tags inside the text.

Now, the one remaining case is the last item, which is not followed by an element with class=header; so it will include all text found 'till the end of the document. Ideas?

1

For the sake of completeness, the final query, composed of various suggestions throughout the thread:

//*[
    @class='textfield' and position() = 1
]
//text() [
    preceding::*[
        @class='header' and contains(text(),'First item')
    ]
][
    following::*[
        preceding::*[
            @class='header'
        ][1][
            contains(text(),'First item')
        ]
    ]
]