how to convert HTML text to plain text?

Question

friend's I have to parse the description from url,where parsed content have few html tags,so how can I convert it to plain text.

What are your precise requirements? Do you need to strip HTML tags? Extract the content of a specific tag? — Vivien Barousse, Aug 31 '10 at 10:05
i can able to extract the content,but the content have
zcc dsdfsf ddfdfsf
sfdfdfdfdf, like the above i'm getting my data but i need to be a simple plain text.without those html tags — MGSenthil, Aug 31 '10 at 10:54
Similar question with good answer here : http://stackoverflow.com/questions/1518675/open-source-java-library-for-html-to-text-conversion/1519726#1519726. I used Jericho and it works fine. — рüффп, Sep 03 '13 at 09:49
Duplicate of http://stackoverflow.com/q/240546/873282, http://stackoverflow.com/q/1699313/873282, http://stackoverflow.com/q/1518675/873282, and http://stackoverflow.com/q/832620/873282 — koppor, Dec 11 '16 at 21:45

score 40 · Answer 1 · answered Mar 15 '19 at 09:01

40

Yes, Jsoup will be the better option. Just do like below to convert the whole HTML text to plain text.

String plainText= Jsoup.parse(yout_html_text).text();

answered Mar 15 '19 at 09:01

Ranjit

5,130
3
30
66

12

To keep the line breaks you can now also use `Jsoup.parse(html).wholeText()` – AvahW Jun 13 '19 at 22:12
This is not working for me, can you please check https://stackoverflow.com/questions/73861739/how-to-parse-text-with-html-tags-into-plain-text – vikramvi Sep 27 '22 at 02:24

score 27 · Answer 2 · answered Aug 31 '10 at 10:58

27

Just getting rid of HTML tags is simple:

// replace all occurrences of one or more HTML tags with optional
// whitespace inbetween with a single space character 
String strippedText = htmlText.replaceAll("(?s)<[^>]*>(\\s*<[^>]*>)*", " ");

But unfortunately the requirements are never that simple:

Usually, <p> and <div> elements need a separate handling, there may be cdata blocks with > characters (e.g. javascript) that mess up the regex etc.

answered Aug 31 '10 at 10:58

Sean Patrick Floyd

292,901
67
465
588

1

For some background on why this will not work for the general case, and won't be f(u|oo)l-proof: [RegEx match open tags except XHTML self-contained tags](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags) – Erwin Bolwidt Apr 12 '17 at 13:00
Love it... so simple, yet so powerful – George Apr 04 '21 at 20:11

score 10 · Answer 3 · edited Apr 23 '12 at 03:57

10

You can use this single line to remove the html tags and display it as plain text.

htmlString=htmlString.replaceAll("\\<.*?\\>", "");

edited Apr 23 '12 at 03:57

demongolem

9,474
36
90
105

answered Sep 03 '10 at 10:16

Kandha

3,659
12
35
50

score 7 · Answer 4 · answered Jan 05 '21 at 05:45

Use Jsoup.

Add the dependency

<dependency>
  <!-- jsoup HTML parser library @ https://jsoup.org/ -->
  <groupId>org.jsoup</groupId>
  <artifactId>jsoup</artifactId>
  <version>1.13.1</version>
</dependency>

Now in your java code:

public static String html2text(String html) {
        return Jsoup.parse(html).wholeText();
    }

Just call the method html2text with passing the html text and it will return plain text.

score 5 · Answer 5 · edited May 23 '17 at 12:18

5

Use a HTML parser like htmlCleaner

For detailed answer : How to remove HTML tag in Java

edited May 23 '17 at 12:18

Community

1
1

answered Aug 31 '10 at 10:06

ankitjaininfo

11,961
7
52
75

score 1 · Answer 6 · answered Aug 31 '10 at 10:07

1

I'd recommend parsing the raw HTML through jTidy which should give you output which you can write xpath expressions against. This is the most robust way I've found of scraping HTML.

answered Aug 31 '10 at 10:07

Jon Freedman

9,469
4
39
58

score 1 · Answer 7 · edited Nov 14 '16 at 14:41

If you want to parse like browser display, use:

import net.htmlparser.jericho.*;
import java.util.*;
import java.io.*;
import java.net.*;

public class RenderToText {
    public static void main(String[] args) throws Exception {
        String sourceUrlString="data/test.html";
        if (args.length==0)
          System.err.println("Using default argument of \""+sourceUrlString+'"');
        else
            sourceUrlString=args[0];
        if (sourceUrlString.indexOf(':')==-1) sourceUrlString="file:"+sourceUrlString;
        Source source=new Source(new URL(sourceUrlString));
        String renderedText=source.getRenderer().toString();
        System.out.println("\nSimple rendering of the HTML document:\n");
        System.out.println(renderedText);
  }
}

I hope this will help to parse table also in the browser format.

Thanks, Ganesh

Can the downvoters please explain why they downvote? – koppor Dec 11 '16 at 21:40 — koppor, Dec 11 '16 at 21:40

score 0 · Answer 8 · edited Oct 04 '18 at 01:35

I needed a plain text representation of some HTML which included FreeMarker tags. The problem was handed to me with a JSoup solution, but JSoup was escaping the FreeMarker tags, thus breaking the functionality. I also tried htmlCleaner (sourceforge), but that left the HTML header and style content (tags removed). http://stackoverflow.com/questions/1518675/open-source-java-library-for-html-to-text-conversion/1519726#1519726

My code:

return new net.htmlparser.jericho.Source(html).getRenderer().setMaxLineLength(Integer.MAX_VALUE).setNewLine(null).toString();

The maxLineLength ensures lines are not artificially wrapped at 80 characters. The setNewLine(null) uses the same new line character(s) as the source.

score 0 · Answer 9 · answered May 20 '20 at 10:04

0

I use HTMLUtil.textFromHTML(value) from

<dependency>
    <groupId>org.clapper</groupId>
    <artifactId>javautil</artifactId>
    <version>3.2.0</version>
</dependency>

answered May 20 '20 at 10:04

Ruslanas

61
6

score 0 · Answer 10 · answered Jan 12 '21 at 21:25

Using Jsoup, I got all the text in the same line.

So I used the following block of code to parse HTML and keep new lines:

private String parseHTMLContent(String toString) {
    String result = toString.replaceAll("\\<.*?\\>", "\n");
    String previousResult = "";
    while(!previousResult.equals(result)){
        previousResult = result;
        result = result.replaceAll("\n\n","\n");
    }
    return result;
}

Not the best solution but solved my problem :)

how to convert HTML text to plain text?

10 Answers10

Linked