read from stdin any kind of encoding, and I don't know what it will be

Question

I am processing email feed in from stdin, and I get emails with 50 different encodings. How do I write code to convert it to UTF-8, or just read it in memory without.

SyntaxError: Non-ASCII character '\xc2' in file mime2vt.py on line 450, but no encoding declared;

I can not declare and encoding because random person on the internet has encoded with whatever.

Read it into memory. I get an error at this step
Detect the type
Decode it

I have altered the addresses to protect myself

From: =?GB2312?B?x+vXqtDox/PIy9Sx?= <vba@uyh.com>
Subject: =?GB2312?B?tNO8vMr119/P8rncwO0=?=
To: me@me.com
Content-Type: text/plain;charset="GB2312"
Date: Sat, 26 Sep 2015 07:49:08 +0800

´Ó¼¼Êõ×ßÏò¹ÜÀí

2015ÄêÊ±¼ä°²ÅÅ
10ÔÂ26-27ÈÕ±±¾©   11ÔÂ2-3ÈÕÉÏº£     10ÔÂ29-30ÈÕÉîÛÚ
11ÔÂ19-20±±¾©     11ÔÂ23-24ÈÕÉÏº£   11ÔÂ30-12ÔÂ1ÈÕÉîÛÚ
12ÔÂ28-29ÈÕ±±¾©   12ÔÂ24-25ÈÕÉÏº£   12ÔÂ21-22ÈÕÉîÛÚ

This is the code that errors.

data =  join([sys.stdin])
   for line in sys.stdin:
      data+=line.decode("utf8")
   msg = email.message_from_string(data)

I don't even get far enough to have access to the charset= value. This could be a faked also, but at least I would have some clue.

I just tried this: data=sys.stdin.read(100) and got this:

  File "/usr/lib64/python3.5/codecs.py", line 321, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb4 in position 546: invalid start byte

I didn't even ask for byte 546 and it still reads it even if I ask for 1 byte.

someone suggested python3 -u </here/whatever that still errors.

Only one answer: you guess! Even the most clever heuristic can occasionally be fooled by certain documents in certain encodings. I'd look for a pre-built solution online if I were you. — jpaugh, Jul 23 '16 at 02:30
Well, here's a [Perl module](http://perldoc.perl.org/Encode/Guess.html). You could try translating that into Python... However you do it, it'll still be a fair amount of work. — jpaugh, Jul 23 '16 at 02:36
Lets say I just want to get into memory first. Then we will look for charset= and try that hope that it is the truth, and deal with the lie possibility separately. — cybernard, Jul 23 '16 at 02:40
In that case, you'll want to read it in as binary data, not text; then, you can try to decode it to text later, and multiple times if you wish. (Text has a character set; binary data does not.) — jpaugh, Jul 23 '16 at 02:43
how? I maybe an excellent c,c++ programmer, and good in perl and php, but python is a whole different story. **show me the code** — cybernard, Jul 23 '16 at 02:44
Possible duplicate of [Reading binary file in Python and looping over each byte](http://stackoverflow.com/questions/1035340/reading-binary-file-in-python-and-looping-over-each-byte) — tripleee, Jul 23 '16 at 06:59
Also http://stackoverflow.com/questions/2850893/reading-binary-data-from-stdin — tripleee, Jul 23 '16 at 07:00
@triplee All these commands use the **open** command to open a file. I need to read from stdin so this is not a duplicate. — cybernard, Jul 23 '16 at 15:46
If you need to interpret binary data, use the struct module. This might be helpful is I knew how to do that. — cybernard, Jul 23 '16 at 16:21
The [second link](http://stackoverflow.com/questions/38537516/read-from-stdin-any-kind-of-encoding-and-i-dont-know-what-it-will-be?noredirect=1#comment64471050_38537516) is specifically about binary data on `sys.stdin`. — tripleee, Jul 23 '16 at 16:23
@tripleee none of those worked. The msvcrt thing is windows only and I am doing this on linux. — cybernard, Jul 24 '16 at 17:55
There are three other answers which do not seem to have portability issues. If you tried all three and they all failed, I'll certainly remove my close vote if you update your question with detailed failure diagnostics for each of those techniques. — tripleee, Jul 24 '16 at 18:01
@triplee #1 does nothing, fails exact same way. #2 is an example on how to write and I am trying to read. #3 even with the (10) it is reads the whole line and fails with can't decode byte whatever. — cybernard, Jul 24 '16 at 18:43

read from stdin any kind of encoding, and I don't know what it will be

0 Answers0