I got a Wikipedia article and I want to fetch the first z lines (or the first x characters, or the first y words; it doesn't matter) from the article.
The problem: I can get either the source Wiki text (via the API) or the parsed HTML (via a direct HTTP request, eventually on the print-version), but how can I find the first lines displayed? Normally, the source (both HTML and wikitext) starts with the info-boxes and images and the first real text to display is somewhere down in the code.
For example:
Albert Einstein on Wikipedia (print version). Look in the code. The first real-text-line "Albert Einstein (pronounced /ˈælbərt ˈaɪnstaɪn/; German: [ˈalbɐt ˈaɪ̯nʃtaɪ̯n]; 14 March 1879–18 April 1955) was a theoretical physicist." is not on the start. The same applies to the Wiki source; it starts with the same info-box and so on.
So how would you accomplish this task? The programming language is Java, but this shouldn't matter.
A solution which came to my mind was to use an XPath query, but this query would be rather complicated to handle all the border-cases.
It wasn't that complicated; see my solution below!