ASCII to HTML-Entities Escaping in Java

Question

I found this website with escape codes and I'm just wondering if someone has done this already so I don't have to spend couple of hours building this logic:

 StringBuffer sb = new StringBuffer();
 int n = s.length();
 for (int i = 0; i < n; i++) {
     char c = s.charAt(i);
     switch (c) {
         case '\u25CF': sb.append("&#9679;"); break;
         case '\u25BA': sb.append("&#9658;"); break;

         /*
         ... the rest of the hex chars literals to HTML entities
         */  

         default:  sb.append(c); break;
     }
 }

see this post...http://stackoverflow.com/questions/994331/java-how-to-decode-html-character-entities-in-java-like-httputility-htmldecode — eat_a_lemon, Mar 26 '11 at 06:42
Do you want the exact same value, or do you need to have some values converted to something else? — Thorbjørn Ravn Andersen, Mar 26 '11 at 08:27
See also: http://stackoverflow.com/questions/1273986/converting-utf-8-to-iso-8859-1-in-java — McDowell, Mar 26 '11 at 11:48
@Mat Banik - re: the results; you sure you don't have a transcoding error at the compilation stage? See here: http://illegalargumentexception.blogspot.com/2009/05/java-rough-guide-to-character-encoding.html#javaencoding_sourcefiles — McDowell, Mar 26 '11 at 15:36

Pawel Veselov · Accepted Answer · 2011-03-27T04:09:35.770

3

These "codes" is a mere decimal representation of the unicode value of the actual character. It seems to me that something like this would work, unless you want to be very strict about which codes get converted, and which don't.

StringBuilder sb = new StringBuilder();
 int n = s.length();
 for (int i = 0; i < n; i++) {
     char c = s.charAt(i);
     if (Character.UnicodeBlock.of(c) != Character.UnicodeBlock.BASIC_LATIN) {
        sb.append("&#");
        sb.append((int)c);
        sb.append(';');
     } else {
        sb.append(c);
     }

 }

edited Mar 27 '11 at 04:09

answered Mar 26 '11 at 07:03

Pawel Veselov

3,996
7
44
62

2

You should take care of surrogate pairs, too. (Which means iterating over code points, not code units.) – Paŭlo Ebermann Mar 26 '11 at 11:46
1

As Paŭlo mentioned, this code is broken for surrogate pairs (e.g. emojis). See [my answer](http://stackoverflow.com/a/37040891/305973) for handling them correctly. – robinst May 05 '16 at 01:38

score 2 · Answer 2 · answered May 05 '16 at 01:36

The other answers don't work correctly for surrogate pairs, e.g. if you have Emojis such as "" (see character info). Here's how to do it in Java 8:

StringBuilder sb = new StringBuilder();
s.codePoints().forEach(codePoint -> {
    if (Character.UnicodeBlock.of(codePoint) != Character.UnicodeBlock.BASIC_LATIN) {
        sb.append("&#");
        sb.append(codePoint);
        sb.append(';');
    } else {
        sb.appendCodePoint(codePoint);
    }
});

And for older Java:

StringBuilder sb = new StringBuilder();
for (int i = 0; i < s.length(); ) {
    int c = s.codePointAt(i);
    if (Character.UnicodeBlock.of(c) != Character.UnicodeBlock.BASIC_LATIN) {
        sb.append("&#");
        sb.append(c);
        sb.append(';');
    } else {
        sb.appendCodePoint(c);
    }
    i += Character.charCount(c);
}

A simple way to test if a solution handles surrogate pairs correctly is to use "\uD83D\uDE00" () as the input. If the output is "&#55357;&#56832;", then it's wrong. The correct output is 😀.

WhiteFang34 · Answer 3 · 2011-03-26T11:55:49.690

0

Hmm, what if you did something like this instead:

if (c > 127) {
    sb.append("&#" + (int) c + ";");
} else {
    sb.append(c);
}

Then you just need to determine the range of characters you want HTML escaped. In this case I just specified any character beyond the ASCII table space.

edited Mar 26 '11 at 11:55

answered Mar 26 '11 at 07:03

WhiteFang34

70,765
18
106
111

Looks like Pawel has a more complete answer. – WhiteFang34 Mar 26 '11 at 07:05
255 is too high for ASCII - it's only 7-bit so you'd want 127. – McDowell Mar 26 '11 at 11:52

ASCII to HTML-Entities Escaping in Java

3 Answers3

Linked