4

I found this website with escape codes and I'm just wondering if someone has done this already so I don't have to spend couple of hours building this logic:

 StringBuffer sb = new StringBuffer();
 int n = s.length();
 for (int i = 0; i < n; i++) {
     char c = s.charAt(i);
     switch (c) {
         case '\u25CF': sb.append("&#9679;"); break;
         case '\u25BA': sb.append("&#9658;"); break;

         /*
         ... the rest of the hex chars literals to HTML entities
         */  

         default:  sb.append(c); break;
     }
 }
MatBanik
  • 26,356
  • 39
  • 116
  • 178
  • see this post...http://stackoverflow.com/questions/994331/java-how-to-decode-html-character-entities-in-java-like-httputility-htmldecode – eat_a_lemon Mar 26 '11 at 06:42
  • Do you want the exact same value, or do you need to have some values converted to something else? – Thorbjørn Ravn Andersen Mar 26 '11 at 08:27
  • See also: http://stackoverflow.com/questions/1273986/converting-utf-8-to-iso-8859-1-in-java – McDowell Mar 26 '11 at 11:48
  • @Mat Banik - re: the results; you sure you don't have a transcoding error at the compilation stage? See here: http://illegalargumentexception.blogspot.com/2009/05/java-rough-guide-to-character-encoding.html#javaencoding_sourcefiles – McDowell Mar 26 '11 at 15:36

3 Answers3

3

These "codes" is a mere decimal representation of the unicode value of the actual character. It seems to me that something like this would work, unless you want to be very strict about which codes get converted, and which don't.

StringBuilder sb = new StringBuilder();
 int n = s.length();
 for (int i = 0; i < n; i++) {
     char c = s.charAt(i);
     if (Character.UnicodeBlock.of(c) != Character.UnicodeBlock.BASIC_LATIN) {
        sb.append("&#");
        sb.append((int)c);
        sb.append(';');
     } else {
        sb.append(c);
     }

 }
Pawel Veselov
  • 3,996
  • 7
  • 44
  • 62
  • 2
    You should take care of surrogate pairs, too. (Which means iterating over code points, not code units.) – Paŭlo Ebermann Mar 26 '11 at 11:46
  • 1
    As Paŭlo mentioned, this code is broken for surrogate pairs (e.g. emojis). See [my answer](http://stackoverflow.com/a/37040891/305973) for handling them correctly. – robinst May 05 '16 at 01:38
2

The other answers don't work correctly for surrogate pairs, e.g. if you have Emojis such as "" (see character info). Here's how to do it in Java 8:

StringBuilder sb = new StringBuilder();
s.codePoints().forEach(codePoint -> {
    if (Character.UnicodeBlock.of(codePoint) != Character.UnicodeBlock.BASIC_LATIN) {
        sb.append("&#");
        sb.append(codePoint);
        sb.append(';');
    } else {
        sb.appendCodePoint(codePoint);
    }
});

And for older Java:

StringBuilder sb = new StringBuilder();
for (int i = 0; i < s.length(); ) {
    int c = s.codePointAt(i);
    if (Character.UnicodeBlock.of(c) != Character.UnicodeBlock.BASIC_LATIN) {
        sb.append("&#");
        sb.append(c);
        sb.append(';');
    } else {
        sb.appendCodePoint(c);
    }
    i += Character.charCount(c);
}

A simple way to test if a solution handles surrogate pairs correctly is to use "\uD83D\uDE00" () as the input. If the output is "&#55357;&#56832;", then it's wrong. The correct output is &#128512;.

robinst
  • 30,027
  • 10
  • 102
  • 108
0

Hmm, what if you did something like this instead:

if (c > 127) {
    sb.append("&#" + (int) c + ";");
} else {
    sb.append(c);
}

Then you just need to determine the range of characters you want HTML escaped. In this case I just specified any character beyond the ASCII table space.

WhiteFang34
  • 70,765
  • 18
  • 106
  • 111