15

Is there the regular expression that can completely remove a HTML tag? By the way, I'm using Java.

tangens
  • 39,095
  • 19
  • 120
  • 139
freddiefujiwara
  • 57,041
  • 28
  • 76
  • 106
  • 2
    Typing your title into the Search box, I got the following: http://stackoverflow.com/search?q=How+to+remove+HTML+tag+in+Java ... did you not get the same while you were posting the question? – kdgregory Nov 09 '09 at 12:37
  • 2
    I found no duplicates. These questions care about extracting text from HTML: http://stackoverflow.com/questions/240546/removing-html-from-a-java-string http://stackoverflow.com/questions/832620/stripping-html-tags-in-java – tangens Nov 10 '09 at 17:24

6 Answers6

24

There is JSoup which is a java library made for HTML manipulation. Look at the clean() method and the WhiteList object. Easy to use solution!

Alex
  • 5,565
  • 6
  • 36
  • 57
Simon
  • 361
  • 3
  • 5
  • 2
    WOW, you sir, really made my day, i like that, YES! Markdownj, Markdown4J, htmlCleaner.. all of them is ***** sorry.. JSoup is the one and only where you really achieve that with a one-liner: String plain = new HtmlToPlainText().getPlainText(Jsoup.parse(html)); – jebbie Jul 17 '13 at 14:56
  • 4
    A shorter code would be `String plaintext = Jsoup.parse(html).text();` – jrarama Jul 09 '15 at 03:24
  • 3
    @jrarama - Not at all. `Jsoup.parse(html).text()` remove all of the tags and whitespace, leaving you with a long single line of text only, while `new HtmlToPlainText().getPlainText(Jsoup.parse(html))` formats the text in a simplistic way, keeping line breaks, paragraphs, bullet points, etc. – isapir Feb 01 '17 at 01:14
  • @isapir: HtmlToPlainText is not incuded in https://mvnrepository.com/artifact/org.jsoup/jsoup/1.11.3 – Marco Sulla May 21 '18 at 07:51
  • That's because HtmlToPlainText is an example, see https://github.com/jhy/jsoup/blob/master/src/main/java/org/jsoup/examples/HtmlToPlainText.java – ChrLipp Oct 03 '18 at 14:22
20

You should use a HTML parser instead. I like htmlCleaner, because it gives me a pretty printed version of the HTML.

With htmlCleaner you can do:

TagNode root = htmlCleaner.clean( stream );
Object[] found = root.evaluateXPath( "//div[id='something']" );
if( found.length > 0 && found instanceof TagNode ) {
    ((TagNode)found[0]).removeFromTree();
}
tangens
  • 39,095
  • 19
  • 120
  • 139
  • Thanks for pointing me to htmlCleaner :) – exhuma Nov 09 '09 at 12:16
  • Do we need to get any library in-order to use this above code? And root.evaluateXPath( "//div[id='something']" ); in this "something " could be any id rite? please let me know. thanks – Geet taunk Aug 26 '13 at 14:54
6

If you just need to remove tags then you can use this regular expression:

content = content.replaceAll("<[^>]+>", "");

It will remove only tags, but not other HTML stuff. For more complex things you should use parser.

EDIT: To avoid problems with HTML comments you can do the following:

content = content.replaceAll("<!--.*?-->", "").replaceAll("<[^>]+>", "");
Andrey Adamovich
  • 20,285
  • 14
  • 94
  • 132
  • Since you do not use any of the meat characters `.`, `^` and `$`, the `s`- and `m` flags can be omitted. – Bart Kiers Nov 09 '09 at 09:50
  • This regex is liable to cause mangling if the HTML contains XML comments with embedded '<' or '>' characters. – Stephen C Nov 09 '09 at 12:24
4

No. Regular expressions can not by definition parse HTML.

You could use a regex to s/<[^>]*\>// or something naive like that but it's going to be insufficient, especially if you're interested in removing the contents of tags.

As another poster said, use an actual HTML parser.

George G
  • 7,443
  • 12
  • 45
  • 59
Moishe Lettvin
  • 8,462
  • 1
  • 26
  • 40
2

You don't need any HTML parser. The below code removes all HTML comments:

htmlString = htmlString.replaceAll("(?s)<!--.*?-->", "");

Saeid
  • 483
  • 5
  • 14
0

you can use this simple code to remove all html tags...

htmlString.replaceAll("\\<.*?\\>", ""))
Kandha
  • 3,659
  • 12
  • 35
  • 50
  • 1
    This will only remove opening tags and leave closing tags unhandled. – jlordo Jan 04 '13 at 23:52
  • i never would do a job like that on my own - parsing html into plain-text is really a though job dude.. – jebbie Jul 17 '13 at 14:57
  • It worked for me but maybe depends on the complexity of the tags, comments, scripts, etc. So, for a complex case maybe a html library should be better. – jmoran Dec 21 '17 at 02:21