What's the encoding standard of such as "汉" in html? And is there a transform package for this format?

Question

On a html page (mhtml is also supported), use

&#27721;

in html body or other elements can contain text, will show as:

汉

What's name of this encoding standard?

And is there some java package for this encoding?

score 2 · Answer 1 · answered Dec 02 '16 at 01:23

2

It's an HTML character entity number, and Apache commons-lang has StringEscapeUtils.escapeHtml(String) and unescapeHtml(String) which can handle these entities.

answered Dec 02 '16 at 01:23

Elliott Frisch

198,278
20
158
249

I test escapeHtml4 and escapeHTML3, it seems remains all the same as the input multibyte character. Not this method? – rufushuang Dec 02 '16 at 01:31

score 0 · Answer 2 · answered Dec 02 '16 at 01:37

27721 is a decimal number of hex 0x6c49 in the ucs2 coding of the Chinese char 汉. Browser will auto convert these charachers.

We also can convert these char use code, here is an example:

 WCHAR * wszUcs2 = L"/x6c49";
 int len = WideCharToMultiByte(CP_ACP, 0, wszUcs2, -1, NULL, 0, NULL, NULL);
 char *szGBK=new char[len + 1];
 szGBK[len] = '/0';
 WideCharToMultiByte (CP_ACP, 0, wszUcs2, -1, szGBK, len, NULL,NULL);
 MessageBoxA(NULL, szGBK, NULL, MB_OK);//output '汉'
 delete[] szGBK;

score 0 · Accepted Answer · edited May 23 '17 at 11:45

After some search, I found this is not simply just HTML Entity. Exactly, it should be called 'HTML Entity with US-ASCII encoding'.

HTML Entity just solved such HTML conflict character, such as <, >, ", &. It doesn't require the multi-bytes character such as 汉 to be encoded. So the apache-commons-lang package StringEscapeUtils.encodeHTML4 inputs 汉 and result also the same 汉.

I found an answer in

https://stackoverflow.com/a/25228492/3198960

With some new java feature, single-quote mark and line-return added, the code should be :

    public static String toHTMLEntity(String s) {
          StringBuilder sb = new StringBuilder();
           for (char c : s.toCharArray()) {
                  if (c > 127 || c == '<' || c == '>' || c == '\'' || c == '"' || c == '&' || c == '=' || c == '\n'
                              || c == '\r') {
                        sb.append("&#").append((int) c).append(';');
                 } else {
                        sb.append(c);
                 }
          }
           return sb.toString();
   }

What's the encoding standard of such as "汉" in html? And is there a transform package for this format?

3 Answers3