Jsoup parser remove words with '<' and '>'

Question

I'm using the Jsoup.parse() to remove html tags from a String. But my string as a word like <name> also.

The problem is Jsoup.parse() remove that too. I'ts because that text has < and >. I can't just remove < and > from the text too. How can I do this.

String s1 = Jsoup.parse("<p>Hello World</p>").text();
//s1 is "Hello World". Correct

String s2 = Jsoup.parse("<name>").text();
//s2 is "". But it should be <name> because <name> is not a html tag

is there any way to parse html with only selected tags. like
,,,,etc — Ravindu, Aug 03 '16 at 06:38

score -1 · Answer 1 · edited May 23 '17 at 12:14

-1

I'm using the Jsoup.parse() to remove html tags from a String.

You want to use the Jsoup#clean method. You'll also need a little manual work after because Jsoup will still see <name> as an HTML tag.

// Define the list of words to preserve...
String[] myExceptions = new String[] { "name" }; 
int nbExceptions = myExceptions.length;

// Build a whitelist for Jsoup...
Whitelist myWhiteList = Whitelist.simpleText().addTags(myExceptions);

// Let Jsoup remove any html tags...
String s2 = Jsoup.clean("<name>", myWhiteList);

// Complete the initial html tags removal...
for (int i = 0; i < nbExceptions; i++) {
    s2 = s2.replaceAll("<" + myExceptions[i] + ">.+?</" + myExceptions[i] + ">", "<" + myExceptions[i] + ">");
}

System.out.println(">>" + s2);

OUTPUT

>><name>

References

edited May 23 '17 at 12:14

Community

1
1

answered Aug 03 '16 at 10:00

Stephan

41,764
65
238
329

This wont work because Jsoup is adding the trailing after the text node. Consider this:
Hello World
....you end up with Hello World – Zack Aug 03 '16 at 17:39
Best thing to do is to escape the special characters. – Zack Aug 03 '16 at 17:40
@ZackTeater Thanks for signaling the issue. I have corrected it. – Stephan Aug 03 '16 at 19:29
Yes, but still using regex on HTML content is not a practical solution. Ideally the whitelist data should be encoded prior to parsing. http://stackoverflow.com/a/1732454/1176178 – Zack Aug 04 '16 at 02:23

Jsoup parser remove words with '<' and '>'

1 Answers1

OUTPUT

References