How to get page's line breaks based on visual representation of html entities (or at least, close to it)?

Question

I have a section of a webpage I identified as my area of interest. It may contain multiple html tags in it, but I want to interpret it as a multiline text, or at least as close as possible to how it's rendered by the browser.

Let me give you an example.

<div>
<p>Line 1<p>
</div>
<div><p>Line 2<p></div> <div><p>Line 3 <p></div>
<p>Line 4<p></div><br />Line 5

In the browser, it is rendered like this:

Line 1

Line 2

Line 3

Line 4

Line 5

I want to run the original html through some sort of lib and get a text with the following contents (or close):

Line 1
Line 2
Line 3
Line 4
Line 5

Please note, I don't want to recover the original line breaks present in the Html (as this question points out. I want to interpret the html entities as line breaks similar to the way it is rendered by the browser. Is there any lib that can do it? I've used Jsoup's TextNode.getWholeText() but it doesn't parse the html tags.

Edit: for linux users out there, I want something similar to the result of:

$ lynx -dump file.html > file.txt

score 0 · Answer 1 · answered Jun 27 '12 at 15:49

0

The <div> tags and <p> tags by default in HTML have padding and margin blocks around them. So it's obvious that is why the browser is rendering it like it is.

Create a CSS file and disable the padding and margin spacing.

Also, why is Java tagged? If you're doing this in a Java Servlet Page check your System.out.println statements.

answered Jun 27 '12 at 15:49

Chad

872
8
24

I don't think you understood my question right, sir. There are two tags in it. This question's main focus is on web scraping with java, not web design. – Cacovsky Jun 27 '12 at 16:16

How to get page's line breaks based on visual representation of html entities (or at least, close to it)?

1 Answers1