On a html page (mhtml is also supported), use
汉
in html body or other elements can contain text, will show as:
汉
What's name of this encoding standard?
And is there some java package for this encoding?
On a html page (mhtml is also supported), use
汉
in html body or other elements can contain text, will show as:
汉
What's name of this encoding standard?
And is there some java package for this encoding?
It's an HTML character entity number, and Apache commons-lang has StringEscapeUtils.escapeHtml(String)
and unescapeHtml(String)
which can handle these entities.
27721 is a decimal number of hex 0x6c49 in the ucs2 coding of the Chinese char 汉. Browser will auto convert these charachers.
We also can convert these char use code, here is an example:
WCHAR * wszUcs2 = L"/x6c49";
int len = WideCharToMultiByte(CP_ACP, 0, wszUcs2, -1, NULL, 0, NULL, NULL);
char *szGBK=new char[len + 1];
szGBK[len] = '/0';
WideCharToMultiByte (CP_ACP, 0, wszUcs2, -1, szGBK, len, NULL,NULL);
MessageBoxA(NULL, szGBK, NULL, MB_OK);//output '汉'
delete[] szGBK;
After some search, I found this is not simply just HTML Entity. Exactly, it should be called 'HTML Entity with US-ASCII encoding'.
HTML Entity just solved such HTML conflict character, such as <, >, ", &. It doesn't require the multi-bytes character such as 汉 to be encoded. So the apache-commons-lang package StringEscapeUtils.encodeHTML4
inputs 汉 and result also the same 汉.
I found an answer in
https://stackoverflow.com/a/25228492/3198960
With some new java feature, single-quote mark and line-return added, the code should be :
public static String toHTMLEntity(String s) {
StringBuilder sb = new StringBuilder();
for (char c : s.toCharArray()) {
if (c > 127 || c == '<' || c == '>' || c == '\'' || c == '"' || c == '&' || c == '=' || c == '\n'
|| c == '\r') {
sb.append("&#").append((int) c).append(';');
} else {
sb.append(c);
}
}
return sb.toString();
}