How to normalize encoding names, like ks_c_5601-1987 to CP949?

Question

I am fetching emails from a mail server and converting the message to UTF-8 charset and save it in DB.To convert the charset I am using mb_convert_encoding but it fails to convert gb2312 and ks_c_5601-1987. On googling I found that instead of gb2312 I can use CP936 and for ks_c_5601-1987 use CP949.

Going by the above approach it would mean to maintain a separate list of charset mappings in my code. Is there a way to normalize names of encodings to names internally supported by PHP hence eliminating the need to maintain any map locally?

`iconv` recognizes `ks_c_5601-1987` but cannot convert. `mb_convert_encoding` doesn't support `949` or `ks_c_5601-1987` at all. `iconv` recognizes and can convert `gb2312` though. — Esailija, Dec 10 '12 at 13:57
mb_convert_encoding supports CP949 under the name UHC according to http://php.net/manual/en/mbstring.supported-encodings.php — borrible, Dec 10 '12 at 13:58
@borrible funny thing is that the docs say `UHC (CP949)` but they couldn't bother to alias it to CP949 as well :P — Esailija, Dec 10 '12 at 15:39

score 2 · Answer 1 · answered Dec 10 '12 at 14:03

According to the list of supported character encodings there are only a small number of encodings listed explicitly by code page. Given the small number of these cases - whilst not a built-in normalisation as requested - a list of mappings may not be too inappropriate.

The relevant ones appear to be the following (the lowercase name on the right is the name you'll need to convert from):

CP932 shift_jis
CP51932 euc_jp
CP50220 iso-2022-jp
CP50221 csISO220JP
CP50222 iso-2022-jp
CP936 gb2312
CP950 big5

The following are also listed by code-page on the PHP documentation but appear to have suitable synonyms already:

CP866 (IBM866)
UHC (CP949)
Windows-1251 (CP1251)
Windows-1252 (CP1252)

How to normalize encoding names, like ks_c_5601-1987 to CP949?

1 Answers1