2

I'm fetching rows from a MySQL database with a unicode_general_ci collation. Columns contains chinese characters such as 格拉巴酒和蒸馏物 and I need to display those characters.

I know that I should work in utf-8 encoding:

<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />

but I can't: I'm working on a legacy application where most of the .php files are saved as ANSI and the whole site is using:

<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1" />

Is there any way to display them?

Bonus question: I've tried to manually change the encoding in Chrome (Tool -> Encodig -> UTF-8) and It seems it doesn't work: page is reloaded but ???? are displayed instead of chinese characters.

gremo
  • 47,186
  • 75
  • 257
  • 421

2 Answers2

3

You can display using the numeric entity reference &#26684;, etc. The encoding of the page should not matter in this case; HTML entity references always refer to Unicode code points.

PHP has a function htmlentities for this purpose, but it appears that you will need workarounds for handling numeric entities. This json_encode hack is fairly obscure, but is probably programmatically the simplest.

echo preg_replace('/\\\\u([0-9a-f]{4})/', '&#x$1;', 
     preg_replace('^/"(.*)"$/', '$1', json_encode($s)));

This leverages the fact that json_encode will coincidentally do the conversion for you; the rest is all mechanics. (I guess that's PHP for you.)

IDEone demo

Your "bonus question" isn't really a question, but of course, that's how it works; raw bytes in the range 128-255 are only rarely valid UTF-8 sequences, so unless what you have on the page is valid UTF-8, you are likely to get the "invalid character" replacement glyph for those bytes.

For the record, the first two Chinese Han glyphs in your text in UTF-8 would display as 格拉 if mistakenly displayed in Windows code page 1252 (what you, and oftentimes Microsoft, carelessly refer to as "ANSI") -- if you have those bytes on the page then forcing the browser to display it in UTF-8 should actually work as a workaround as well.

For additional background I recommend @deceze's What Every Programmer Absolutely, Positively Needs to Know About Encodings and Character Sets to Work With Text.

Community
  • 1
  • 1
tripleee
  • 175,061
  • 34
  • 275
  • 318
1

I'm not sure that you can. iso-8859-1 is commonly called "Latin 1". There's no support for any Asian kanji-type languages at all.

http://en.wikipedia.org/wiki/ISO/IEC_8859-1

ISO 8859-1 encodes what it refers to as "Latin alphabet no. 1," consisting of 191 characters from the Latin script. This character-encoding scheme is used throughout the Americas, Western Europe, Oceania, and much of Africa. It is also commonly used in most standard romanizations of East-Asian languages.

Machavity
  • 30,841
  • 27
  • 92
  • 100