How to remove HTML tag in Java

Question

Is there the regular expression that can completely remove a HTML tag? By the way, I'm using Java.

Typing your title into the Search box, I got the following: http://stackoverflow.com/search?q=How+to+remove+HTML+tag+in+Java ... did you not get the same while you were posting the question? — kdgregory, Nov 09 '09 at 12:37
I found no duplicates. These questions care about extracting text from HTML: http://stackoverflow.com/questions/240546/removing-html-from-a-java-string http://stackoverflow.com/questions/832620/stripping-html-tags-in-java — tangens, Nov 10 '09 at 17:24

score 24 · Answer 1 · edited Jan 27 '12 at 17:26

24

There is JSoup which is a java library made for HTML manipulation. Look at the clean() method and the WhiteList object. Easy to use solution!

edited Jan 27 '12 at 17:26

Alex

5,565
6
36
57

answered Jan 27 '12 at 16:40

Simon

361
3
5

2

WOW, you sir, really made my day, i like that, YES! Markdownj, Markdown4J, htmlCleaner.. all of them is ***** sorry.. JSoup is the one and only where you really achieve that with a one-liner: String plain = new HtmlToPlainText().getPlainText(Jsoup.parse(html)); – jebbie Jul 17 '13 at 14:56
4

A shorter code would be `String plaintext = Jsoup.parse(html).text();` – jrarama Jul 09 '15 at 03:24
3

@jrarama - Not at all. `Jsoup.parse(html).text()` remove all of the tags and whitespace, leaving you with a long single line of text only, while `new HtmlToPlainText().getPlainText(Jsoup.parse(html))` formats the text in a simplistic way, keeping line breaks, paragraphs, bullet points, etc. – isapir Feb 01 '17 at 01:14
@isapir: HtmlToPlainText is not incuded in https://mvnrepository.com/artifact/org.jsoup/jsoup/1.11.3 – Marco Sulla May 21 '18 at 07:51
That's because HtmlToPlainText is an example, see https://github.com/jhy/jsoup/blob/master/src/main/java/org/jsoup/examples/HtmlToPlainText.java – ChrLipp Oct 03 '18 at 14:22

score 20 · Accepted Answer · answered Nov 09 '09 at 06:05

20

You should use a HTML parser instead. I like htmlCleaner, because it gives me a pretty printed version of the HTML.

With htmlCleaner you can do:

TagNode root = htmlCleaner.clean( stream );
Object[] found = root.evaluateXPath( "//div[id='something']" );
if( found.length > 0 && found instanceof TagNode ) {
    ((TagNode)found[0]).removeFromTree();
}

answered Nov 09 '09 at 06:05

tangens

39,095
19
120
139

Thanks for pointing me to htmlCleaner :) – exhuma Nov 09 '09 at 12:16
Do we need to get any library in-order to use this above code? And root.evaluateXPath( "//div[id='something']" ); in this "something " could be any id rite? please let me know. thanks – Geet taunk Aug 26 '13 at 14:54

Andrey Adamovich · Answer 3 · 2009-11-09T12:40:30.310

6

If you just need to remove tags then you can use this regular expression:

content = content.replaceAll("<[^>]+>", "");

It will remove only tags, but not other HTML stuff. For more complex things you should use parser.

EDIT: To avoid problems with HTML comments you can do the following:

content = content.replaceAll("<!--.*?-->", "").replaceAll("<[^>]+>", "");

edited Nov 09 '09 at 12:40

answered Nov 09 '09 at 07:29

Andrey Adamovich

20,285
14
94
132

Since you do not use any of the meat characters `.`, `^` and `$`, the `s`- and `m` flags can be omitted. – Bart Kiers Nov 09 '09 at 09:50
This regex is liable to cause mangling if the HTML contains XML comments with embedded '<' or '>' characters. – Stephen C Nov 09 '09 at 12:24

score 4 · Answer 4 · edited Jan 08 '16 at 09:57

4

No. Regular expressions can not by definition parse HTML.

You could use a regex to s/<[^>]*\>// or something naive like that but it's going to be insufficient, especially if you're interested in removing the contents of tags.

As another poster said, use an actual HTML parser.

edited Jan 08 '16 at 09:57

George G

7,443
12
45
59

answered Nov 09 '09 at 06:13

Moishe Lettvin

8,462
1
26
40

score 2 · Answer 5 · answered Jun 13 '12 at 06:09

2

You don't need any HTML parser. The below code removes all HTML comments:

htmlString = htmlString.replaceAll("(?s)", "");

answered Jun 13 '12 at 06:09

Saeid

483
5
14

score 0 · Answer 6 · answered Sep 03 '10 at 10:13

0

you can use this simple code to remove all html tags...

htmlString.replaceAll("\\<.*?\\>", ""))

answered Sep 03 '10 at 10:13

Kandha

3,659
12
35
50

1

This will only remove opening tags and leave closing tags unhandled. – jlordo Jan 04 '13 at 23:52
i never would do a job like that on my own - parsing html into plain-text is really a though job dude.. – jebbie Jul 17 '13 at 14:57
It worked for me but maybe depends on the complexity of the tags, comments, scripts, etc. So, for a complex case maybe a html library should be better. – jmoran Dec 21 '17 at 02:21

How to remove HTML tag in Java

6 Answers6

Linked

Related