0

I have a SMTP email which is in Japanese and some part is in english. The Subject of the email is encoded in UTF-8, base64.

Subject: =?UTF-8?B?5Y2K5bCO5L2T6KO96YCg6KOF572u44OX44Os44OT44Ol44O844OO44O8?= =?UTF-8?B?44OIIDog5b6M5bel56iL44Oh44O844Kr44O844GM5by344GE?=

How do I detect if this in Japanese/Chinese and decode it to Japanese/Chinese.

Can I acheive this in Perl/Java/Python?

user1737619
  • 247
  • 1
  • 2
  • 15

3 Answers3

5

You have two steps here. First decode the header:

If you have an email, use a high-level email parser such as Courriel. The subject accessor will return the decoded subject.

If you just have the string, use Encode::MIME::Header:

use Encode qw(decode);
decode 'MIME-Header', 'Subject: =?UTF-8?B?5Y2K5bCO5L2T6KO96YCg6KOF572u44OX44Os44OT44Ol44O844OO44O8?= =?UTF-8?B?44OIIDog5b6M5bel56iL44Oh44O844Kr44O844GM5by344GE?='
__END__
Subject: 半導体製造装置プレビューノート : 後工程メーカーが強い

The second step is to find out the language. As a human, I can already tell that this is Japanese. The kana characters are the clue, they only occur in Japanese writing. If that's all you need, then if the string matches \p{Kana}, it's likely Japanese.

For a more general solution, you use a language detection library such as Lingua::Identify::CLD, Lingua::Ident, Lingua::Lid, Lingua::YALI, WebService::Google::Language.

daxim
  • 39,270
  • 4
  • 65
  • 132
  • For anyone landing here, there is a very decent perl module [Lingua::Guess](https://metacpan.org/pod/Lingua::Guess), which you could use in command line, like: `echo "このモジュールを上手く使えるかな。" | perl -MLingua::Guess -MEncode=decode_utf8 -nlE 'say "Language is ", Lingua::Guess->new->guess( decode_utf8($_) )->[0]->{name}'` which would produce `Language is japanese` – Jacques May 17 '23 at 22:32
1

You may have to check these

chardet character set detection developed by Mozilla used in FireFox. Source code

jchardet is a java port of the source from mozilla's automatic charset detection algorithm.

Juned Ahsan
  • 67,789
  • 12
  • 98
  • 136
  • 1
    +1 The only answer that actually answer the question. – Thihara Jun 26 '13 at 07:05
  • The Perl ports of that library are mentioned in http://stackoverflow.com/questions/1970660/how-can-i-guess-the-encoding-of-a-string-in-perl – daxim Jun 26 '13 at 13:54
  • -1, but it's the wrong tool, the question asked for language detection, not encoding detection. – daxim Jun 26 '13 at 14:01
-1

With Java, you need a library to decode your Base 64 string to binary, for example apache codec.

Then it's straigthforward:

  byte[] b = Base64.decodeBase64("5Y2K5bCO5L2T6KO96YCg6KOF572u44OX44Os44OT44Ol44O844OO44O8");
  String s = new String(b, "UTF-8");
  System.out.println(s);

It prints: 半導体製造装置プレビューノー (I don't know what it means, but it sure looks like japanese).

obourgain
  • 8,856
  • 6
  • 42
  • 57
  • How can I identify which language it is in the code from the output text? – user1737619 Jun 26 '13 at 06:57
  • If you want to recognize if the text is Japanese or English, I suggest to look at https://code.google.com/p/language-detection/. – obourgain Jun 26 '13 at 07:11
  • This code actually prints out ??????????????. Should I change the language settings for it to print japanese in eclipse? – user1737619 Jun 26 '13 at 07:13
  • I don't see how it can print ?????????. Be sure to parse only the part of the String between =?UTF-8?B? and ?= (there are two parts in the subject). – obourgain Jun 26 '13 at 07:25
  • I changed the text file encoding to UTF-8 in the project Properties of eclipse and it worked. – user1737619 Jun 26 '13 at 08:28