I have a section of a webpage I identified as my area of interest. It may contain multiple html tags in it, but I want to interpret it as a multiline text, or at least as close as possible to how it's rendered by the browser.
Let me give you an example.
<div>
<p>Line 1<p>
</div>
<div><p>Line 2<p></div> <div><p>Line 3 <p></div>
<p>Line 4<p></div><br />Line 5
In the browser, it is rendered like this:
Line 1
Line 2
Line 3
Line 4
Line 5
I want to run the original html through some sort of lib and get a text with the following contents (or close):
Line 1
Line 2
Line 3
Line 4
Line 5
Please note, I don't want to recover the original line breaks present in the Html (as this question points out. I want to interpret the html entities as line breaks similar to the way it is rendered by the browser. Is there any lib that can do it? I've used Jsoup's TextNode.getWholeText() but it doesn't parse the html tags.
Edit: for linux users out there, I want something similar to the result of:
$ lynx -dump file.html > file.txt