114

I have the following code:

 public class NewClass {
     public String noTags(String str){
         return Jsoup.parse(str).text();
     }


     public static void main(String args[]) {
         String strings="<!DOCTYPE HTML PUBLIC \"-//W3C//DTD HTML 4.0 Transitional//EN \">" +
         "<HTML> <HEAD> <TITLE></TITLE> <style>body{ font-size: 12px;font-family: verdana, arial, helvetica, sans-serif;}</style> </HEAD> <BODY><p><b>hello world</b></p><p><br><b>yo</b> <a href=\"http://google.com\">googlez</a></p></BODY> </HTML> ";

         NewClass text = new NewClass();
         System.out.println((text.noTags(strings)));
}

And I have the result:

hello world yo googlez

But I want to break the line:

hello world
yo googlez

I have looked at jsoup's TextNode#getWholeText() but I can't figure out how to use it.

If there's a <br> in the markup I parse, how can I get a line break in my resulting output?

randers
  • 5,031
  • 5
  • 37
  • 64
Billy
  • 1,141
  • 2
  • 8
  • 3
  • edit your text - there is no line break showing up in your question. In general please read the preview of your question before posting it, to check everything is showing up right. – Robin Green Apr 12 '11 at 19:15
  • I asked the same question (without the jsoup requirement) but I still do not have a good solution: http://stackoverflow.com/questions/2513707/how-to-convert-html-to-text-keeping-linebreaks – Eduardo Jul 19 '11 at 21:56
  • see @zeenosaur 's answer. – Jang-Ho Bae Sep 16 '19 at 13:40

15 Answers15

112

The real solution that preserves linebreaks should be like this:

public static String br2nl(String html) {
    if(html==null)
        return html;
    Document document = Jsoup.parse(html);
    document.outputSettings(new Document.OutputSettings().prettyPrint(false));//makes html() preserve linebreaks and spacing
    document.select("br").append("\\n");
    document.select("p").prepend("\\n\\n");
    String s = document.html().replaceAll("\\\\n", "\n");
    return Jsoup.clean(s, "", Whitelist.none(), new Document.OutputSettings().prettyPrint(false));
}

It satisfies the following requirements:

  1. if the original html contains newline(\n), it gets preserved
  2. if the original html contains br or p tags, they gets translated to newline(\n).
user207421
  • 305,947
  • 44
  • 307
  • 483
user121196
  • 30,032
  • 57
  • 148
  • 198
  • the answer by @MircoAttocchi works best for me. this solution leaves entities as such...that's not good! i.e. "La porta è aperta" remains unchanged, whereas I want "La porta è aperta". – Vito Meuli Jan 09 '14 at 17:26
  • 4
    br2nl is not the most helpful or accurate method name – DD. Sep 17 '14 at 22:22
  • 2
    This is the best answer. But how about `for (Element e : document.select("br")) e.after(new TextNode("\n", ""));` appending real newline and not the sequence \n? See [Node::after()](http://jsoup.org/apidocs/org/jsoup/nodes/Node.html#after%28org.jsoup.nodes.Node%29) and [Elements::append()](http://jsoup.org/apidocs/org/jsoup/select/Elements.html#append%28java.lang.String%29) for the difference. The `replaceAll()` is not be needed in this case. Similar for p and other block elements. – user2043553 Oct 01 '14 at 08:05
  • 1
    @user121196's answer should be the chosen answer. If you still have HTML entities after you clean the input HTML, apply StringEscapeUtils.unescapeHtml(...) Apache commons to the output from the Jsoup clean. – karth500 May 06 '15 at 01:13
  • 8
    See https://github.com/jhy/jsoup/blob/master/src/main/java/org/jsoup/examples/HtmlToPlainText.java for a comprehensive answer to this problem. – Malcolm Smith May 19 '17 at 08:29
  • `

    Line one

    Line 2` should NOT be `\nLine one Line 2` newlines have to be inserted before AND after the relevant block tags. and it's missing MANY block tags such as `
    ` and `
  • `.
– user3338098 May 29 '19 at 00:01