- Do you mean that you don't know which CJK encoding the incoming message is in?
The canonical place to find that information is the charset=
parameter in the Content-Type:
header.
Unfortunately extracting that is not as straightforward as you would hope. Really you'd think that the object returned by imap_header
would contain the type information, but it doesn't. Instead, you have to use imap_fetchheader
to grab the raw headers from the message, and parse them yourself.
Parsing RFC822 headers isn't completely straightforward. For simple cases you might be able to get away with matching each line against ^content-type:.*; *charset=([^;]+)
(case-insensitively). But to do it really properly though you'd have to run the whole message headers and body through a proper RFC822-family parser like MailParse.
And then you've still got the problem of messages that neglect to include charset
information. For that case you would need to use mb_detect_encoding
.
- Or are you just worried about which language the correctly-decoded characters represent?
In this case the header you want to read, using the same method as above, is Content-Language
. However it is very often not present in which case you have to fall back to guessing again. CJK Unification means that all languages may use many of the same characters, but there are a few heuristics you can use to guess:
The encoding that the message was in, from the above. eg if it was EUC-CN, chances are your languages is going to be simplified Chinese.
The presence of any kana (U+3040–U+30FF -> Japanese) or Hangul (U+AC00–U+D7FF -> Korean) in the text.
The presence of simplified vs traditional Chinese characters. Although some characters can represent either, others (where there is a significant change to the strokes between the two variants) only fit one. The simple way to detect their presence is to attempt to encode the string to GBK and Big5 encodings and see if it fails. ie if you can't encode to GBK but you can to Big5, it'll be traditional Chinese.