1

I am trying to write a small application to extract content from Wikipedia pages. When I first thought if it, I thought that I could just target divs containing content with XPath, but after looking into how Wikipedia builds their articles, I quickly discovered that wouldn't be so easy. The best way to separate content when I get the page, is to select what's between two sets of h2 tags.

Example: <h2>Title</h2> <div>Some Content</div> <h2>Title</h2>

Here I would want to get the div between the sets of headers. I tried doing this with XPath, but with no luck at all. I am going to look more into XPath because I think that's what I need to use to achieve what I want, but before I look too much into it, I would like to hear what you guys think about it. Is XPath the right way to go or do I have other easier options? I write the application in C# if that makes any difference.

kjhughes
  • 106,133
  • 27
  • 181
  • 240
Severin
  • 962
  • 5
  • 21

2 Answers2

3

Yes, you're on the right track with XPath -- it's ideal for selecting parts of an XML document.

For example, for this XML,

<r>
   <h2>Title A</h2>
   <div>Some Content</div>
   <div>More Content</div>
   <h2>Title B</h2>
</r>

this XPath,

//div[preceding-sibling::h2 = 'Title A' and following-sibling::h2 = 'Title B']

will select this content,

<div>Some Content</div>
<div>More Content</div>

between the two h2 titles, as requested.


Update to address OP's self-answer:

For this new XML example,

<div>
    <h2><span>Summary</span></h2>
    <p>Paragraph</p>
    <ul>
        <li>List1</li>
        <li>List2</li>
        <li>List3</li>
    </ul>
    <p>Paragraph</p>

    <h2><span>Location</span></h2>
    <p>Paragraph</p>
</div>

the XPath I provided above can easily be adapted,

//*[preceding-sibling::h2 = 'Summary' and following-sibling::h2 = 'Location']

to select this XML,

<p>Paragraph</p>  
<ul>
   <li>List1</li>
   <li>List2</li>
   <li>List3</li>
</ul>    
<p>Paragraph</p>

as requested.

kjhughes
  • 106,133
  • 27
  • 181
  • 240
  • 1
    That was exactly what I was looking for! Thank you :-) I will mark as correct answer after I test it when I get home. – Severin Aug 22 '16 at 13:31
  • I added my own answer the correct answer. Your answer did guide me in the right direction though! – Severin Aug 22 '16 at 15:03
  • You're welcome. Please [**accept**](http://meta.stackoverflow.com/q/5234/234215) this answer if it's helped. Thanks. (Not sure what you mean by *I added my own answer the correct answer*, unless you mean you had to make adjustments -- I don't see another SO answer posted here by you to this question.) – kjhughes Aug 22 '16 at 15:09
  • I just posted it now. Got distracted by a phonecall ^^ – Severin Aug 22 '16 at 15:14
  • I'm glad you got it working on your own, however you might want to review my updated answer that works with your new sample; it's simpler and more robust than what you've posted in your answer. You'll want to understand the [difference between testing text nodes and string values](http://stackoverflow.com/a/34595441/290085), for example. – kjhughes Aug 22 '16 at 15:21
0

With the help from kjhughes suggestion, I managed to get the code working.

I was unable to make the = 'Text' part work, but replaced it with [text() = 'text']

That alone wasn't enough, as the title of the content I need is location inside a span in a h2 tag, so I had to adapt the XPath a bit more.

This is what I came up with:

//*[preceding-sibling::h2::following-sibling::span[text() = 'Summary'] and following-sibling::h2::following-sibling::span[text() = 'Location']]

I tested it using http://www.xpathtester.com/xpath on this HTML:

<div>
    <h2><span>Summary</span></h2>
    <p>Paragraph</p>
    <ul>
        <li>List1</li>
        <li>List2</li>
        <li>List3</li>
    </ul>
    <p>Paragraph</p>

    <h2><span>Location</span></h2>
    <p>Paragraph</p>
</div>

Which gave me the following result:

<p>Paragraph</p>
<ul>
    <li>List1</li>
    <li>List2</li>
    <li>List3</li>
</ul>
<p>Paragraph</p>
Severin
  • 962
  • 5
  • 21