Is there the regular expression that can completely remove a HTML tag? By the way, I'm using Java.
-
2Typing your title into the Search box, I got the following: http://stackoverflow.com/search?q=How+to+remove+HTML+tag+in+Java ... did you not get the same while you were posting the question? – kdgregory Nov 09 '09 at 12:37
-
2I found no duplicates. These questions care about extracting text from HTML: http://stackoverflow.com/questions/240546/removing-html-from-a-java-string http://stackoverflow.com/questions/832620/stripping-html-tags-in-java – tangens Nov 10 '09 at 17:24
6 Answers
There is JSoup which is a java library made for HTML manipulation. Look at the clean()
method and the WhiteList
object. Easy to use solution!
-
2WOW, you sir, really made my day, i like that, YES! Markdownj, Markdown4J, htmlCleaner.. all of them is ***** sorry.. JSoup is the one and only where you really achieve that with a one-liner: String plain = new HtmlToPlainText().getPlainText(Jsoup.parse(html)); – jebbie Jul 17 '13 at 14:56
-
4A shorter code would be `String plaintext = Jsoup.parse(html).text();` – jrarama Jul 09 '15 at 03:24
-
3@jrarama - Not at all. `Jsoup.parse(html).text()` remove all of the tags and whitespace, leaving you with a long single line of text only, while `new HtmlToPlainText().getPlainText(Jsoup.parse(html))` formats the text in a simplistic way, keeping line breaks, paragraphs, bullet points, etc. – isapir Feb 01 '17 at 01:14
-
@isapir: HtmlToPlainText is not incuded in https://mvnrepository.com/artifact/org.jsoup/jsoup/1.11.3 – Marco Sulla May 21 '18 at 07:51
-
That's because HtmlToPlainText is an example, see https://github.com/jhy/jsoup/blob/master/src/main/java/org/jsoup/examples/HtmlToPlainText.java – ChrLipp Oct 03 '18 at 14:22
You should use a HTML parser instead. I like htmlCleaner, because it gives me a pretty printed version of the HTML.
With htmlCleaner you can do:
TagNode root = htmlCleaner.clean( stream );
Object[] found = root.evaluateXPath( "//div[id='something']" );
if( found.length > 0 && found instanceof TagNode ) {
((TagNode)found[0]).removeFromTree();
}

- 39,095
- 19
- 120
- 139
-
-
Do we need to get any library in-order to use this above code? And root.evaluateXPath( "//div[id='something']" ); in this "something " could be any id rite? please let me know. thanks – Geet taunk Aug 26 '13 at 14:54
If you just need to remove tags then you can use this regular expression:
content = content.replaceAll("<[^>]+>", "");
It will remove only tags, but not other HTML stuff. For more complex things you should use parser.
EDIT: To avoid problems with HTML comments you can do the following:
content = content.replaceAll("<!--.*?-->", "").replaceAll("<[^>]+>", "");

- 20,285
- 14
- 94
- 132
-
Since you do not use any of the meat characters `.`, `^` and `$`, the `s`- and `m` flags can be omitted. – Bart Kiers Nov 09 '09 at 09:50
-
This regex is liable to cause mangling if the HTML contains XML comments with embedded '<' or '>' characters. – Stephen C Nov 09 '09 at 12:24
No. Regular expressions can not by definition parse HTML.
You could use a regex to s/<[^>]*\>//
or something naive like that but it's going to be insufficient, especially if you're interested in removing the contents of tags.
As another poster said, use an actual HTML parser.

- 7,443
- 12
- 45
- 59

- 8,462
- 1
- 26
- 40
You don't need any HTML parser. The below code removes all HTML comments:
htmlString = htmlString.replaceAll("(?s)<!--.*?-->", "");

- 483
- 5
- 14
you can use this simple code to remove all html tags...
htmlString.replaceAll("\\<.*?\\>", ""))

- 3,659
- 12
- 35
- 50
-
1
-
i never would do a job like that on my own - parsing html into plain-text is really a though job dude.. – jebbie Jul 17 '13 at 14:57
-
It worked for me but maybe depends on the complexity of the tags, comments, scripts, etc. So, for a complex case maybe a html library should be better. – jmoran Dec 21 '17 at 02:21