0

I am processing email feed in from stdin, and I get emails with 50 different encodings. How do I write code to convert it to UTF-8, or just read it in memory without.

SyntaxError: Non-ASCII character '\xc2' in file mime2vt.py on line 450, but no encoding declared;

I can not declare and encoding because random person on the internet has encoded with whatever.

  1. Read it into memory. I get an error at this step
  2. Detect the type
  3. Decode it

I have altered the addresses to protect myself

Return-Path: <vba@uyh.com>
X-Original-To: me@me.com
Delivered-To: me@me.com
Received: from uyh.com (unknown [27.150.160.116])
        by <me> (Postfix) with ESMTP id 528FC7B5A53
        for <me@me.com>; Fri, 25 Sep 2015 18:49:13 -0500 (CDT)
From: =?GB2312?B?x+vXqtDox/PIy9Sx?= <vba@uyh.com>
Subject: =?GB2312?B?tNO8vMr119/P8rncwO0=?=
To: me@me.com
Content-Type: text/plain;charset="GB2312"
Date: Sat, 26 Sep 2015 07:49:08 +0800
X-Priority: 2
X-Mailer: Microsoft Outlook Express 5.50.4133.2400

´Ó¼¼Êõ×ßÏò¹ÜÀí

2015Äêʱ¼ä°²ÅÅ
10ÔÂ26-27ÈÕ±±¾©   11ÔÂ2-3ÈÕÉϺ£     10ÔÂ29-30ÈÕÉîÛÚ
11ÔÂ19-20±±¾©     11ÔÂ23-24ÈÕÉϺ£   11ÔÂ30-12ÔÂ1ÈÕÉîÛÚ
12ÔÂ28-29ÈÕ±±¾©   12ÔÂ24-25ÈÕÉϺ£   12ÔÂ21-22ÈÕÉîÛÚ

This is the code that errors.

data =  join([sys.stdin])
   for line in sys.stdin:
      data+=line.decode("utf8")
   msg = email.message_from_string(data)

I don't even get far enough to have access to the charset= value. This could be a faked also, but at least I would have some clue.

I just tried this: data=sys.stdin.read(100) and got this:

  File "/usr/lib64/python3.5/codecs.py", line 321, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb4 in position 546: invalid start byte

I didn't even ask for byte 546 and it still reads it even if I ask for 1 byte.

someone suggested python3 -u </here/whatever that still errors.

cybernard
  • 180
  • 10
  • Only one answer: you guess! Even the most clever heuristic can occasionally be fooled by certain documents in certain encodings. I'd look for a pre-built solution online if I were you. – jpaugh Jul 23 '16 at 02:30
  • for example? and how do I integrate it into my code. – cybernard Jul 23 '16 at 02:32
  • Well, here's a [Perl module](http://perldoc.perl.org/Encode/Guess.html). You could try translating that into Python... However you do it, it'll still be a fair amount of work. – jpaugh Jul 23 '16 at 02:36
  • Lets say I just want to get into memory first. Then we will look for charset= and try that hope that it is the truth, and deal with the lie possibility separately. – cybernard Jul 23 '16 at 02:40
  • In that case, you'll want to read it in as binary data, not text; then, you can try to decode it to text later, and multiple times if you wish. (Text has a character set; binary data does not.) – jpaugh Jul 23 '16 at 02:43
  • how? I maybe an excellent c,c++ programmer, and good in perl and php, but python is a whole different story. **show me the code** – cybernard Jul 23 '16 at 02:44
  • Possible duplicate of [Reading binary file in Python and looping over each byte](http://stackoverflow.com/questions/1035340/reading-binary-file-in-python-and-looping-over-each-byte) – tripleee Jul 23 '16 at 06:59
  • Also http://stackoverflow.com/questions/2850893/reading-binary-data-from-stdin – tripleee Jul 23 '16 at 07:00
  • @triplee All these commands use the **open** command to open a file. I need to read from stdin so this is not a duplicate. – cybernard Jul 23 '16 at 15:46
  • If you need to interpret binary data, use the struct module. This might be helpful is I knew how to do that. – cybernard Jul 23 '16 at 16:21
  • The [second link](http://stackoverflow.com/questions/38537516/read-from-stdin-any-kind-of-encoding-and-i-dont-know-what-it-will-be?noredirect=1#comment64471050_38537516) is specifically about binary data on `sys.stdin`. – tripleee Jul 23 '16 at 16:23
  • @tripleee none of those worked. The msvcrt thing is windows only and I am doing this on linux. – cybernard Jul 24 '16 at 17:55
  • There are three other answers which do not seem to have portability issues. If you tried all three and they all failed, I'll certainly remove my close vote if you update your question with detailed failure diagnostics for each of those techniques. – tripleee Jul 24 '16 at 18:01
  • @triplee #1 does nothing, fails exact same way. #2 is an example on how to write and I am trying to read. #3 even with the (10) it is reads the whole line and fails with can't decode byte whatever. – cybernard Jul 24 '16 at 18:43

0 Answers0