How do I preserve line breaks when using jsoup to convert html to plain text?

Question

I have the following code:

 public class NewClass {
     public String noTags(String str){
         return Jsoup.parse(str).text();
     }


     public static void main(String args[]) {
         String strings="<!DOCTYPE HTML PUBLIC \"-//W3C//DTD HTML 4.0 Transitional//EN \">" +
         "<HTML> <HEAD> <TITLE></TITLE> <style>body{ font-size: 12px;font-family: verdana, arial, helvetica, sans-serif;}</style> </HEAD> <BODY><p><b>hello world</b></p><p><br><b>yo</b> <a href=\"http://google.com\">googlez</a></p></BODY> </HTML> ";

         NewClass text = new NewClass();
         System.out.println((text.noTags(strings)));
}

And I have the result:

hello world yo googlez

But I want to break the line:

hello world
yo googlez

I have looked at jsoup's TextNode#getWholeText() but I can't figure out how to use it.

If there's a <br> in the markup I parse, how can I get a line break in my resulting output?

edit your text - there is no line break showing up in your question. In general please read the preview of your question before posting it, to check everything is showing up right. — Robin Green, Apr 12 '11 at 19:15
I asked the same question (without the jsoup requirement) but I still do not have a good solution: http://stackoverflow.com/questions/2513707/how-to-convert-html-to-text-keeping-linebreaks — Eduardo, Jul 19 '11 at 21:56

score 112 · Answer 1 · edited Mar 11 '16 at 23:26

112

The real solution that preserves linebreaks should be like this:

public static String br2nl(String html) {
    if(html==null)
        return html;
    Document document = Jsoup.parse(html);
    document.outputSettings(new Document.OutputSettings().prettyPrint(false));//makes html() preserve linebreaks and spacing
    document.select("br").append("\\n");
    document.select("p").prepend("\\n\\n");
    String s = document.html().replaceAll("\\\\n", "\n");
    return Jsoup.clean(s, "", Whitelist.none(), new Document.OutputSettings().prettyPrint(false));
}

It satisfies the following requirements:

if the original html contains newline(\n), it gets preserved
if the original html contains br or p tags, they gets translated to newline(\n).

edited Mar 11 '16 at 23:26

user207421

305,947
44
307
483

answered Oct 26 '13 at 02:57

user121196

30,032
57
148
198

the answer by @MircoAttocchi works best for me. this solution leaves entities as such...that's not good! i.e. "La porta è aperta" remains unchanged, whereas I want "La porta è aperta". – Vito Meuli Jan 09 '14 at 17:26
4

br2nl is not the most helpful or accurate method name – DD. Sep 17 '14 at 22:22
2

This is the best answer. But how about `for (Element e : document.select("br")) e.after(new TextNode("\n", ""));` appending real newline and not the sequence \n? See [Node::after()](http://jsoup.org/apidocs/org/jsoup/nodes/Node.html#after%28org.jsoup.nodes.Node%29) and [Elements::append()](http://jsoup.org/apidocs/org/jsoup/select/Elements.html#append%28java.lang.String%29) for the difference. The `replaceAll()` is not be needed in this case. Similar for p and other block elements. – user2043553 Oct 01 '14 at 08:05
1

@user121196's answer should be the chosen answer. If you still have HTML entities after you clean the input HTML, apply StringEscapeUtils.unescapeHtml(...) Apache commons to the output from the Jsoup clean. – karth500 May 06 '15 at 01:13
8

See https://github.com/jhy/jsoup/blob/master/src/main/java/org/jsoup/examples/HtmlToPlainText.java for a comprehensive answer to this problem. – Malcolm Smith May 19 '17 at 08:29
`
Line one
Line 2` should NOT be `\nLine one Line 2` newlines have to be inserted before AND after the relevant block tags. and it's missing MANY block tags such as `
` and `
`.

– user3338098 May 29 '19 at 00:01

score 46 · Answer 2 · edited Dec 10 '16 at 07:31

46

With

Jsoup.parse("A\nB").text();

you have output

"A B"

and not

A

B

For this I'm using:

descrizione = Jsoup.parse(html.replaceAll("(?i)<br[^>]*>", "br2n")).text();
text = descrizione.replaceAll("br2n", "\n");

edited Dec 10 '16 at 07:31

Dmitry Stolbov

2,759
2
25
23

answered May 17 '11 at 13:26

Mirco Attocchi

776
1
7
14

2

Indeed this is an easy palliative, but IMHO this should be fully handled by the Jsoup library itself (which has at this time a few disturbing behaviors like this one - otherwise it's a great library !). – SRG May 14 '12 at 09:52
7

Doesn't JSoup give you a DOM? Why not just replace all `
` elements with text nodes containing new lines and then call `.text()` instead of doing a regex transform that will cause incorrect output for some strings like `
'not an attribute'>
` – Mike Samuel Apr 23 '13 at 17:00
5

Nice, but where does that "descrizione" come from? – Steve Waters Apr 01 '15 at 08:20
"descrizione" represents the variable the plain text gets assigned to – enigma969 Jun 13 '18 at 12:32

Paulius Z · Answer 3 · 2014-08-19T11:26:20.160

45

Jsoup.clean(unsafeString, "", Whitelist.none(), new OutputSettings().prettyPrint(false));

We're using this method here:

public static String clean(String bodyHtml,
                       String baseUri,
                       Whitelist whitelist,
                       Document.OutputSettings outputSettings)

By passing it Whitelist.none() we make sure that all HTML is removed.

By passsing new OutputSettings().prettyPrint(false) we make sure that the output is not reformatted and line breaks are preserved.

edited Aug 19 '14 at 11:26

answered Apr 23 '13 at 16:46

Paulius Z

499
4
6

This should be the only correct answer. All others assume that only `br` tags produce new lines. What about any other block element in HTML such as `div`, `p`, `ul` etc? All of them introduce new lines too. – adarshr Sep 18 '14 at 21:25
10

With this solution, the html "
line 1
line 2
line 3
" produced the output: "line 1line 2line 3" with no new lines. – JohnC Dec 07 '15 at 03:37
3

This doesn't work for me;
's aren't creating line breaks. – Grumblesaurus Dec 01 '17 at 06:26
Thanks! Also, `Whitelist` has been replaced with `Safelist` class. – kolobok Jun 21 '23 at 17:57

zeenosaur · Answer 4 · 2022-01-10T04:06:40.647

33

On Jsoup v1.11.2, we can now use Element.wholeText().

String cleanString = Jsoup.parse(htmlString).wholeText();

user121196's answer still works. But wholeText() preserves the alignment of texts.

edited Jan 10 '22 at 04:06

answered May 17 '18 at 14:04

zeenosaur

888
11
16

today, in 2023 is working :-) thanks – DoctorWho May 18 '23 at 09:05

score 24 · Answer 5 · answered Jun 24 '13 at 15:42

24

Try this by using jsoup:

public static String cleanPreserveLineBreaks(String bodyHtml) {

    // get pretty printed html with preserved br and p tags
    String prettyPrintedBodyFragment = Jsoup.clean(bodyHtml, "", Whitelist.none().addTags("br", "p"), new OutputSettings().prettyPrint(true));
    // get plain text with preserved line breaks by disabled prettyPrint
    return Jsoup.clean(prettyPrintedBodyFragment, "", Whitelist.none(), new OutputSettings().prettyPrint(false));
}

answered Jun 24 '13 at 15:42

mkowa

241
2
2

nice it works me with a small change `new Document.OutputSettings().prettyPrint(true)` – Ashu May 29 '18 at 02:03
This solution leaves " " as text instead of parsing them into a space. – Andrei Volgin Jul 22 '19 at 16:19

score 11 · Answer 6 · answered Sep 21 '17 at 12:49

11

For more complex HTML none of the above solutions worked quite right; I was able to successfully do the conversion while preserving line breaks with:

Document document = Jsoup.parse(myHtml);
String text = new HtmlToPlainText().getPlainText(document);

(version 1.10.3)

answered Sep 21 '17 at 12:49

Andy Res

15,963
5
60
96

Yes this does a good job. – Mustafa Apr 27 '22 at 04:21

popcorny · Answer 7 · 2013-08-01T11:31:48.913

You can traverse a given element

public String convertNodeToText(Element element)
{
    final StringBuilder buffer = new StringBuilder();

    new NodeTraversor(new NodeVisitor() {
        boolean isNewline = true;

        @Override
        public void head(Node node, int depth) {
            if (node instanceof TextNode) {
                TextNode textNode = (TextNode) node;
                String text = textNode.text().replace('\u00A0', ' ').trim();                    
                if(!text.isEmpty())
                {                        
                    buffer.append(text);
                    isNewline = false;
                }
            } else if (node instanceof Element) {
                Element element = (Element) node;
                if (!isNewline)
                {
                    if((element.isBlock() || element.tagName().equals("br")))
                    {
                        buffer.append("\n");
                        isNewline = true;
                    }
                }
            }                
        }

        @Override
        public void tail(Node node, int depth) {                
        }                        
    }).traverse(element);        

    return buffer.toString();               
}

And for your code

String result = convertNodeToText(JSoup.parse(html))

I think you should test if `isBlock` in `tail(node, depth)` instead, and append `\n` when leaving the block rather than when entering it? I'm doing that (i.e. using `tail`) and that works fine. However if I use `head` like you do, then this: `
line one
line two` ends up as a single line. — KajMagnus, Jul 22 '15 at 06:00
`new NodeTraversor(nodeVisitor).traverse(element);` no longer works on newer Jsoup versions (currently 1.14.3). Now all `traverse` methods in NodeTraversor are `static` so should be called like `NodeTraversor.traverse(nodeVisitor, element);`. — Pshemo, Jan 06 '22 at 14:05

score 5 · Answer 8 · edited May 23 '17 at 12:18

Based on the other answers and the comments on this question it seems that most people coming here are really looking for a general solution that will provide a nicely formatted plain text representation of an HTML document. I know I was.

Fortunately JSoup already provide a pretty comprehensive example of how to achieve this: HtmlToPlainText.java

The example FormattingVisitor can easily be tweaked to your preference and deals with most block elements and line wrapping.

To avoid link rot, here is Jonathan Hedley's solution in full:

package org.jsoup.examples;

import org.jsoup.Jsoup;
import org.jsoup.helper.StringUtil;
import org.jsoup.helper.Validate;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.nodes.Node;
import org.jsoup.nodes.TextNode;
import org.jsoup.select.Elements;
import org.jsoup.select.NodeTraversor;
import org.jsoup.select.NodeVisitor;

import java.io.IOException;

/**
 * HTML to plain-text. This example program demonstrates the use of jsoup to convert HTML input to lightly-formatted
 * plain-text. That is divergent from the general goal of jsoup's .text() methods, which is to get clean data from a
 * scrape.
 * <p>
 * Note that this is a fairly simplistic formatter -- for real world use you'll want to embrace and extend.
 * </p>
 * <p>
 * To invoke from the command line, assuming you've downloaded the jsoup jar to your current directory:</p>
 * <p><code>java -cp jsoup.jar org.jsoup.examples.HtmlToPlainText url [selector]</code></p>
 * where <i>url</i> is the URL to fetch, and <i>selector</i> is an optional CSS selector.
 * 
 * @author Jonathan Hedley, jonathan@hedley.net
 */
public class HtmlToPlainText {
    private static final String userAgent = "Mozilla/5.0 (jsoup)";
    private static final int timeout = 5 * 1000;

    public static void main(String... args) throws IOException {
        Validate.isTrue(args.length == 1 || args.length == 2, "usage: java -cp jsoup.jar org.jsoup.examples.HtmlToPlainText url [selector]");
        final String url = args[0];
        final String selector = args.length == 2 ? args[1] : null;

        // fetch the specified URL and parse to a HTML DOM
        Document doc = Jsoup.connect(url).userAgent(userAgent).timeout(timeout).get();

        HtmlToPlainText formatter = new HtmlToPlainText();

        if (selector != null) {
            Elements elements = doc.select(selector); // get each element that matches the CSS selector
            for (Element element : elements) {
                String plainText = formatter.getPlainText(element); // format that element to plain text
                System.out.println(plainText);
            }
        } else { // format the whole doc
            String plainText = formatter.getPlainText(doc);
            System.out.println(plainText);
        }
    }

    /**
     * Format an Element to plain-text
     * @param element the root element to format
     * @return formatted text
     */
    public String getPlainText(Element element) {
        FormattingVisitor formatter = new FormattingVisitor();
        NodeTraversor traversor = new NodeTraversor(formatter);
        traversor.traverse(element); // walk the DOM, and call .head() and .tail() for each node

        return formatter.toString();
    }

    // the formatting rules, implemented in a breadth-first DOM traverse
    private class FormattingVisitor implements NodeVisitor {
        private static final int maxWidth = 80;
        private int width = 0;
        private StringBuilder accum = new StringBuilder(); // holds the accumulated text

        // hit when the node is first seen
        public void head(Node node, int depth) {
            String name = node.nodeName();
            if (node instanceof TextNode)
                append(((TextNode) node).text()); // TextNodes carry all user-readable text in the DOM.
            else if (name.equals("li"))
                append("\n * ");
            else if (name.equals("dt"))
                append("  ");
            else if (StringUtil.in(name, "p", "h1", "h2", "h3", "h4", "h5", "tr"))
                append("\n");
        }

        // hit when all of the node's children (if any) have been visited
        public void tail(Node node, int depth) {
            String name = node.nodeName();
            if (StringUtil.in(name, "br", "dd", "dt", "p", "h1", "h2", "h3", "h4", "h5"))
                append("\n");
            else if (name.equals("a"))
                append(String.format(" <%s>", node.absUrl("href")));
        }

        // appends text to the string builder with a simple word wrap method
        private void append(String text) {
            if (text.startsWith("\n"))
                width = 0; // reset counter if starts with a newline. only from formats above, not in natural text
            if (text.equals(" ") &&
                    (accum.length() == 0 || StringUtil.in(accum.substring(accum.length() - 1), " ", "\n")))
                return; // don't accumulate long runs of empty spaces

            if (text.length() + width > maxWidth) { // won't fit, needs to wrap
                String words[] = text.split("\\s+");
                for (int i = 0; i < words.length; i++) {
                    String word = words[i];
                    boolean last = i == words.length - 1;
                    if (!last) // insert a space if not the last word
                        word = word + " ";
                    if (word.length() + width > maxWidth) { // wrap and reset counter
                        accum.append("\n").append(word);
                        width = word.length();
                    } else {
                        accum.append(word);
                        width += word.length();
                    }
                }
            } else { // fits as is, without need to wrap text
                accum.append(text);
                width += text.length();
            }
        }

        @Override
        public String toString() {
            return accum.toString();
        }
    }
}

One advantage this has over the simple `Element.wholeText` is that it extracts href links — Dario Seidl, Oct 24 '22 at 10:43

score 4 · Answer 9 · answered Jul 24 '14 at 04:53

text = Jsoup.parse(html.replaceAll("(?i)<br[^>]*>", "br2n")).text();
text = descrizione.replaceAll("br2n", "\n");

works if the html itself doesn't contain "br2n"

So,

text = Jsoup.parse(html.replaceAll("(?i)<br[^>]*>", "<pre>\n</pre>")).text();

works more reliable and easier.

score 3 · Answer 10 · answered Sep 18 '13 at 17:02

Use textNodes() to get a list of the text nodes. Then concatenate them with \n as separator. Here's some scala code I use for this, java port should be easy:

val rawTxt = doc.body().getElementsByTag("div").first.textNodes()
                    .asScala.mkString("<br />\n")

abdolence · Answer 11 · 2019-07-23T09:41:54.860

This is my version of translating html to text (the modified version of user121196 answer, actually).

This doesn't just preserve line breaks, but also formatting text and removing excessive line breaks, HTML escape symbols, and you will get a much better result from your HTML (in my case I'm receiving it from mail).

It's originally written in Scala, but you can change it to Java easily

def html2text( rawHtml : String ) : String = {

    val htmlDoc = Jsoup.parseBodyFragment( rawHtml, "/" )
    htmlDoc.select("br").append("\\nl")
    htmlDoc.select("div").prepend("\\nl").append("\\nl")
    htmlDoc.select("p").prepend("\\nl\\nl").append("\\nl\\nl")

    org.jsoup.parser.Parser.unescapeEntities(
        Jsoup.clean(
          htmlDoc.html(),
          "",
          Whitelist.none(),
          new org.jsoup.nodes.Document.OutputSettings().prettyPrint(true)
        ),false
    ).
    replaceAll("\\\\nl", "\n").
    replaceAll("\r","").
    replaceAll("\n\\s+\n","\n").
    replaceAll("\n\n+","\n\n").     
    trim()      
}

You need to prepend a new line to
tags as well. Otherwise, if a div follows or tags, it will not be on a new line. — Andrei Volgin, Jul 22 '19 at 17:20

score 3 · Answer 12 · answered Sep 08 '17 at 19:38

Try this by using jsoup:

    doc.outputSettings(new OutputSettings().prettyPrint(false));

    //select all <br> tags and append \n after that
    doc.select("br").after("\\n");

    //select all <p> tags and prepend \n before that
    doc.select("p").before("\\n");

    //get the HTML from the document, and retaining original new lines
    String str = doc.html().replaceAll("\\\\n", "\n");

score 3 · Answer 13 · answered Apr 12 '11 at 20:08

3

Try this:

public String noTags(String str){
    Document d = Jsoup.parse(str);
    TextNode tn = new TextNode(d.body().html(), "");
    return tn.getWholeText();
}

answered Apr 12 '11 at 20:08

manji

47,442
5
96
103

1

hello world

yo googlez
but i need hello world yo googlez (without html tags) – Billy Apr 13 '11 at 05:12
This answer doesn't return plain text; it returns HTML with newlines inserted. – KajMagnus Jul 22 '15 at 03:43

Chris6647 · Answer 14 · 2014-01-27T21:16:34.840

/**
 * Recursive method to replace html br with java \n. The recursive method ensures that the linebreaker can never end up pre-existing in the text being replaced.
 * @param html
 * @param linebreakerString
 * @return the html as String with proper java newlines instead of br
 */
public static String replaceBrWithNewLine(String html, String linebreakerString){
    String result = "";
    if(html.contains(linebreakerString)){
        result = replaceBrWithNewLine(html, linebreakerString+"1");
    } else {
        result = Jsoup.parse(html.replaceAll("(?i)<br[^>]*>", linebreakerString)).text(); // replace and html line breaks with java linebreak.
        result = result.replaceAll(linebreakerString, "\n");
    }
    return result;
}

Used by calling with the html in question, containing the br, along with whatever string you wish to use as the temporary newline placeholder. For example:

replaceBrWithNewLine(element.html(), "br2n")

The recursion will ensure that the string you use as newline/linebreaker placeholder will never actually be in the source html, as it will keep adding a "1" untill the linkbreaker placeholder string is not found in the html. It wont have the formatting issue that the Jsoup.clean methods seem to encounter with special characters.

Good one, but you don't need recursion, just add this line: while(dirtyHTML.contains(linebreakerString)) linebreakerString = linebreakerString + "1"; — Dr NotSoKind, Jan 27 '14 at 15:03
Ah, yes. Completely true. Guess my mind got caught up in for once actually being able to use recursion :) — Chris6647, Jan 27 '14 at 20:17

score 1 · Answer 15 · answered May 31 '16 at 18:14

Based on user121196's and Green Beret's answer with the selects and <pre>s, the only solution which works for me is:

org.jsoup.nodes.Element elementWithHtml = ....
elementWithHtml.select("br").append("<pre>\n</pre>");
elementWithHtml.select("p").prepend("<pre>\n\n</pre>");
elementWithHtml.text();

How do I preserve line breaks when using jsoup to convert html to plain text?

15 Answers15

Linked

Related