friend's I have to parse the description from url,where parsed content have few html tags,so how can I convert it to plain text.
10 Answers
Yes, Jsoup will be the better option. Just do like below to convert the whole HTML text to plain text.
String plainText= Jsoup.parse(yout_html_text).text();

- 5,130
- 3
- 30
- 66
-
12To keep the line breaks you can now also use `Jsoup.parse(html).wholeText()` – AvahW Jun 13 '19 at 22:12
-
This is not working for me, can you please check https://stackoverflow.com/questions/73861739/how-to-parse-text-with-html-tags-into-plain-text – vikramvi Sep 27 '22 at 02:24
Just getting rid of HTML tags is simple:
// replace all occurrences of one or more HTML tags with optional
// whitespace inbetween with a single space character
String strippedText = htmlText.replaceAll("(?s)<[^>]*>(\\s*<[^>]*>)*", " ");
But unfortunately the requirements are never that simple:
Usually, <p>
and <div>
elements need a separate handling, there may be cdata blocks with >
characters (e.g. javascript) that mess up the regex etc.

- 292,901
- 67
- 465
- 588
-
1For some background on why this will not work for the general case, and won't be f(u|oo)l-proof: [RegEx match open tags except XHTML self-contained tags](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags) – Erwin Bolwidt Apr 12 '17 at 13:00
-
You can use this single line to remove the html tags and display it as plain text.
htmlString=htmlString.replaceAll("\\<.*?\\>", "");

- 9,474
- 36
- 90
- 105

- 3,659
- 12
- 35
- 50
Use Jsoup.
Add the dependency
<dependency>
<!-- jsoup HTML parser library @ https://jsoup.org/ -->
<groupId>org.jsoup</groupId>
<artifactId>jsoup</artifactId>
<version>1.13.1</version>
</dependency>
Now in your java code:
public static String html2text(String html) {
return Jsoup.parse(html).wholeText();
}
Just call the method html2text with passing the html text and it will return plain text.

- 193
- 1
- 6
Use a HTML parser like htmlCleaner
For detailed answer : How to remove HTML tag in Java

- 1
- 1

- 11,961
- 7
- 52
- 75
I'd recommend parsing the raw HTML through jTidy which should give you output which you can write xpath expressions against. This is the most robust way I've found of scraping HTML.

- 9,469
- 4
- 39
- 58
If you want to parse like browser display, use:
import net.htmlparser.jericho.*;
import java.util.*;
import java.io.*;
import java.net.*;
public class RenderToText {
public static void main(String[] args) throws Exception {
String sourceUrlString="data/test.html";
if (args.length==0)
System.err.println("Using default argument of \""+sourceUrlString+'"');
else
sourceUrlString=args[0];
if (sourceUrlString.indexOf(':')==-1) sourceUrlString="file:"+sourceUrlString;
Source source=new Source(new URL(sourceUrlString));
String renderedText=source.getRenderer().toString();
System.out.println("\nSimple rendering of the HTML document:\n");
System.out.println(renderedText);
}
}
I hope this will help to parse table also in the browser format.
Thanks, Ganesh

- 1,350
- 16
- 32

- 31
- 4
I needed a plain text representation of some HTML which included FreeMarker tags. The problem was handed to me with a JSoup solution, but JSoup was escaping the FreeMarker tags, thus breaking the functionality. I also tried htmlCleaner (sourceforge), but that left the HTML header and style content (tags removed). http://stackoverflow.com/questions/1518675/open-source-java-library-for-html-to-text-conversion/1519726#1519726
My code:
return new net.htmlparser.jericho.Source(html).getRenderer().setMaxLineLength(Integer.MAX_VALUE).setNewLine(null).toString();
The maxLineLength
ensures lines are not artificially wrapped at 80 characters.
The setNewLine(null)
uses the same new line character(s) as the source.

- 47,830
- 31
- 106
- 135

- 677
- 5
- 15
I use HTMLUtil.textFromHTML(value)
from
<dependency>
<groupId>org.clapper</groupId>
<artifactId>javautil</artifactId>
<version>3.2.0</version>
</dependency>

- 61
- 6
Using Jsoup, I got all the text in the same line.
So I used the following block of code to parse HTML and keep new lines:
private String parseHTMLContent(String toString) {
String result = toString.replaceAll("\\<.*?\\>", "\n");
String previousResult = "";
while(!previousResult.equals(result)){
previousResult = result;
result = result.replaceAll("\n\n","\n");
}
return result;
}
Not the best solution but solved my problem :)

- 46
- 4
zcc dsdfsf ddfdfsf
sfdfdfdfdf, like the above i'm getting my data but i need to be a simple plain text.without those html tags – MGSenthil Aug 31 '10 at 10:54