I am trying to write a small application to extract content from Wikipedia pages. When I first thought if it, I thought that I could just target divs containing content with XPath, but after looking into how Wikipedia builds their articles, I quickly discovered that wouldn't be so easy. The best way to separate content when I get the page, is to select what's between two sets of h2
tags.
Example:
<h2>Title</h2> <div>Some Content</div> <h2>Title</h2>
Here I would want to get the div
between the sets of headers. I tried doing this with XPath, but with no luck at all. I am going to look more into XPath because I think that's what I need to use to achieve what I want, but before I look too much into it, I would like to hear what you guys think about it. Is XPath the right way to go or do I have other easier options? I write the application in C# if that makes any difference.