JSoup Element wholeText removing spacing

Question

I am using the following code to parse HTML with JSoup:

Jsoup.parse(html).wholeText()

My html samples will include text like the following:

<p>some text</p><br /><br>later

However, the output from JSoup is always:

some textlater

My desired output is something like this:

some text

later

(note the line returns after 'some text' and 'later')

Is there a different method instead of wholeText that I should use if I want to retain spacing? I did find the following stackoverflow question which was similar:
How do I preserve line breaks when using jsoup to convert html to plain text?

However, the problem with that question is that all of the users wanted to use string replacements or regex to look for br or other specific tags. I am looking for something more general purpose (like an html parser that removes html tags while retaining line breaks and other whitespace - it doesn't have to be jsoup either if there is a better java library).

aisha · Answer 1 · 2019-06-22T16:50:36.883

0

You can use:

Document doc = Jsoup.parse(html);

which will convert your string into html nodes for manipulation then use

doc.outputSettings().indentAmount(0).prettyPrint(false);

to keep the space and styling in place.

then to return the html to string

doc.body().html().toString();

edited Jun 22 '19 at 16:50

answered Jun 22 '19 at 16:43

aisha

33
8

JSoup Element wholeText removing spacing

1 Answers1