PHP chinese character IMAP

Question

I retrieve data from an email through IMAP and i want to detect (via PHP) whether the body have characters in Chinese, Japanese, or Korean programmatically. I know to encoding but no to detect

    $mbox = imap_open ("{localhost:995/pop3/ssl/novalidate-cert}",  "info@***.com", "********");

    $email=$_REQUEST['email'];

    $num_mensaje = imap_search($mbox,"FROM $email");

    // grab the body for the same message
        $body =  imap_fetchbody($mbox,$num_mensaje[0],"1");

            //chinese for example
        $str = mb_convert_encoding($body,"UTF-8","EUC-CN");

    imap_close($mbox);

Any idea

Related: [Detect chinese (multibyte) character in the string](http://stackoverflow.com/q/1550950) (not the accepted answer, the other one) and [What's the complete range for Chinese characters in Unicode?](http://stackoverflow.com/q/1366068) — Pekka, Nov 06 '11 at 10:59
i try this : echo (preg_match('/^[\x{4E00}-\x{9FA5}]*$/u', $body) ? "Found" : "Not Found"); — josiland, Nov 06 '11 at 11:15
echo (preg_match('/^[\x{4E00}-\x{9FA5}]*$/u', $str) ? "Found" : "Not Found"); — josiland, Nov 06 '11 at 11:17
What is your actual use case here? Why do you need to convert character set data? — Pekka, Nov 06 '11 at 11:38
That doesn't answer the question. What are you trying to do? What encoding do you need the data to be in the end? What is your current problem? — Pekka, Nov 06 '11 at 13:10

bobince · Answer 1 · 2011-11-06T11:58:26.400

Do you mean that you don't know which CJK encoding the incoming message is in?

The canonical place to find that information is the charset= parameter in the Content-Type: header.

Unfortunately extracting that is not as straightforward as you would hope. Really you'd think that the object returned by imap_header would contain the type information, but it doesn't. Instead, you have to use imap_fetchheader to grab the raw headers from the message, and parse them yourself.

Parsing RFC822 headers isn't completely straightforward. For simple cases you might be able to get away with matching each line against ^content-type:.*; *charset=([^;]+) (case-insensitively). But to do it really properly though you'd have to run the whole message headers and body through a proper RFC822-family parser like MailParse.

And then you've still got the problem of messages that neglect to include charset information. For that case you would need to use mb_detect_encoding.

Or are you just worried about which language the correctly-decoded characters represent?

In this case the header you want to read, using the same method as above, is Content-Language. However it is very often not present in which case you have to fall back to guessing again. CJK Unification means that all languages may use many of the same characters, but there are a few heuristics you can use to guess:

The encoding that the message was in, from the above. eg if it was EUC-CN, chances are your languages is going to be simplified Chinese.
The presence of any kana (U+3040–U+30FF -> Japanese) or Hangul (U+AC00–U+D7FF -> Korean) in the text.
The presence of simplified vs traditional Chinese characters. Although some characters can represent either, others (where there is a significant change to the strokes between the two variants) only fit one. The simple way to detect their presence is to attempt to encode the string to GBK and Big5 encodings and see if it fails. ie if you can't encode to GBK but you can to Big5, it'll be traditional Chinese.

Yeah, then you'd have to fall back to guessing as per (2) and (3). It's not very nice, but then pretty much everything to do with e-mail handling is unreliable and overcomplicated. — bobince, Nov 06 '11 at 11:59
yeah. I'm not really sure what the OP wants in the first place - they may be looking for detecting the encoding rather than *characters*. — Pekka, Nov 06 '11 at 12:01

PHP chinese character IMAP

1 Answers1