Display chinese characters WITHOUT using utf8 encoding?

Question

I'm fetching rows from a MySQL database with a unicode_general_ci collation. Columns contains chinese characters such as 格拉巴酒和蒸馏物 and I need to display those characters.

I know that I should work in utf-8 encoding:

<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />

but I can't: I'm working on a legacy application where most of the .php files are saved as ANSI and the whole site is using:

<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1" />

Is there any way to display them?

Bonus question: I've tried to manually change the encoding in Chrome (Tool -> Encodig -> UTF-8) and It seems it doesn't work: page is reloaded but ???? are displayed instead of chinese characters.

score 3 · Answer 1 · edited May 23 '17 at 11:44

You can display 格 using the numeric entity reference 格, etc. The encoding of the page should not matter in this case; HTML entity references always refer to Unicode code points.

PHP has a function htmlentities for this purpose, but it appears that you will need workarounds for handling numeric entities. This json_encode hack is fairly obscure, but is probably programmatically the simplest.

echo preg_replace('/\\\\u([0-9a-f]{4})/', '&#x$1;', 
     preg_replace('^/"(.*)"$/', '$1', json_encode($s)));

This leverages the fact that json_encode will coincidentally do the conversion for you; the rest is all mechanics. (I guess that's PHP for you.)

IDEone demo

Your "bonus question" isn't really a question, but of course, that's how it works; raw bytes in the range 128-255 are only rarely valid UTF-8 sequences, so unless what you have on the page is valid UTF-8, you are likely to get the "invalid character" replacement glyph for those bytes.

For the record, the first two Chinese Han glyphs in your text in UTF-8 would display as æ ¼æ‹‰ if mistakenly displayed in Windows code page 1252 (what you, and oftentimes Microsoft, carelessly refer to as "ANSI") -- if you have those bytes on the page then forcing the browser to display it in UTF-8 should actually work as a workaround as well.

For additional background I recommend @deceze's What Every Programmer Absolutely, Positively Needs to Know About Encodings and Character Sets to Work With Text.

http://stackoverflow.com/questions/7106470/utf-8-to-unicode-code-points has some ideas but they all seem to have flaws. — tripleee, Aug 26 '14 at 20:31

score 1 · Accepted Answer · answered Aug 26 '14 at 19:50

1

I'm not sure that you can. iso-8859-1 is commonly called "Latin 1". There's no support for any Asian kanji-type languages at all.

http://en.wikipedia.org/wiki/ISO/IEC_8859-1

ISO 8859-1 encodes what it refers to as "Latin alphabet no. 1," consisting of 191 characters from the Latin script. This character-encoding scheme is used throughout the Americas, Western Europe, Oceania, and much of Africa. It is also commonly used in most standard romanizations of East-Asian languages.

answered Aug 26 '14 at 19:50

Machavity

30,841
27
92
100

"Kanji" is the Japanese term. The Chinese term is "hanzi". – tripleee Aug 26 '14 at 20:10
I was afraid of this. I suppose my only chance is to do some trick to only use utf8 in the specific page(s). – gremo Aug 26 '14 at 20:13
@tripleee Hence why I used "kanji-type". Nice to know the proper term tho – Machavity Aug 26 '14 at 22:08

Display chinese characters WITHOUT using utf8 encoding?

2 Answers2