2

I'm using the Jsoup.parse() to remove html tags from a String. But my string as a word like <name> also.

The problem is Jsoup.parse() remove that too. I'ts because that text has < and >. I can't just remove < and > from the text too. How can I do this.

String s1 = Jsoup.parse("<p>Hello World</p>").text();
//s1 is "Hello World". Correct

String s2 = Jsoup.parse("<name>").text();
//s2 is "". But it should be <name> because <name> is not a html tag
Ravindu
  • 2,408
  • 8
  • 30
  • 46

1 Answers1

-1

I'm using the Jsoup.parse() to remove html tags from a String.

You want to use the Jsoup#clean method. You'll also need a little manual work after because Jsoup will still see <name> as an HTML tag.

// Define the list of words to preserve...
String[] myExceptions = new String[] { "name" }; 
int nbExceptions = myExceptions.length;

// Build a whitelist for Jsoup...
Whitelist myWhiteList = Whitelist.simpleText().addTags(myExceptions);

// Let Jsoup remove any html tags...
String s2 = Jsoup.clean("<name>", myWhiteList);

// Complete the initial html tags removal...
for (int i = 0; i < nbExceptions; i++) {
    s2 = s2.replaceAll("<" + myExceptions[i] + ">.+?</" + myExceptions[i] + ">", "<" + myExceptions[i] + ">");
}

System.out.println(">>" + s2);

OUTPUT

>><name>

References

Community
  • 1
  • 1
Stephan
  • 41,764
  • 65
  • 238
  • 329
  • This wont work because Jsoup is adding the trailing after the text node. Consider this:

    Hello World

    ....you end up with Hello World
    – Zack Aug 03 '16 at 17:39
  • Best thing to do is to escape the special characters. – Zack Aug 03 '16 at 17:40
  • @ZackTeater Thanks for signaling the issue. I have corrected it. – Stephan Aug 03 '16 at 19:29
  • Yes, but still using regex on HTML content is not a practical solution. Ideally the whitelist data should be encoded prior to parsing. http://stackoverflow.com/a/1732454/1176178 – Zack Aug 04 '16 at 02:23