7

I am fetching emails from a mail server and converting the message to UTF-8 charset and save it in DB.To convert the charset I am using mb_convert_encoding but it fails to convert gb2312 and ks_c_5601-1987. On googling I found that instead of gb2312 I can use CP936 and for ks_c_5601-1987 use CP949.

Going by the above approach it would mean to maintain a separate list of charset mappings in my code. Is there a way to normalize names of encodings to names internally supported by PHP hence eliminating the need to maintain any map locally?

borrible
  • 17,120
  • 7
  • 53
  • 75
Nidhi Kaushal
  • 299
  • 4
  • 15
  • `iconv` recognizes `ks_c_5601-1987` but cannot convert. `mb_convert_encoding` doesn't support `949` or `ks_c_5601-1987` at all. `iconv` recognizes and can convert `gb2312` though. – Esailija Dec 10 '12 at 13:57
  • mb_convert_encoding supports CP949 under the name UHC according to http://php.net/manual/en/mbstring.supported-encodings.php – borrible Dec 10 '12 at 13:58
  • @borrible funny thing is that the docs say `UHC (CP949)` but they couldn't bother to alias it to CP949 as well :P – Esailija Dec 10 '12 at 15:39

1 Answers1

2

According to the list of supported character encodings there are only a small number of encodings listed explicitly by code page. Given the small number of these cases - whilst not a built-in normalisation as requested - a list of mappings may not be too inappropriate.

The relevant ones appear to be the following (the lowercase name on the right is the name you'll need to convert from):

  • CP932 shift_jis
  • CP51932 euc_jp
  • CP50220 iso-2022-jp
  • CP50221 csISO220JP
  • CP50222 iso-2022-jp
  • CP936 gb2312
  • CP950 big5

The following are also listed by code-page on the PHP documentation but appear to have suitable synonyms already:

  • CP866 (IBM866)
  • UHC (CP949)
  • Windows-1251 (CP1251)
  • Windows-1252 (CP1252)
borrible
  • 17,120
  • 7
  • 53
  • 75