I am processing email feed in from stdin, and I get emails with 50 different encodings. How do I write code to convert it to UTF-8, or just read it in memory without.
SyntaxError: Non-ASCII character '\xc2' in file mime2vt.py on line 450, but no encoding declared;
I can not declare and encoding because random person on the internet has encoded with whatever.
- Read it into memory. I get an error at this step
- Detect the type
- Decode it
I have altered the addresses to protect myself
Return-Path: <vba@uyh.com>
X-Original-To: me@me.com
Delivered-To: me@me.com
Received: from uyh.com (unknown [27.150.160.116])
by <me> (Postfix) with ESMTP id 528FC7B5A53
for <me@me.com>; Fri, 25 Sep 2015 18:49:13 -0500 (CDT)
From: =?GB2312?B?x+vXqtDox/PIy9Sx?= <vba@uyh.com>
Subject: =?GB2312?B?tNO8vMr119/P8rncwO0=?=
To: me@me.com
Content-Type: text/plain;charset="GB2312"
Date: Sat, 26 Sep 2015 07:49:08 +0800
X-Priority: 2
X-Mailer: Microsoft Outlook Express 5.50.4133.2400
´Ó¼¼Êõ×ßÏò¹ÜÀí
2015Äêʱ¼ä°²ÅÅ
10ÔÂ26-27ÈÕ±±¾© 11ÔÂ2-3ÈÕÉϺ£ 10ÔÂ29-30ÈÕÉîÛÚ
11ÔÂ19-20±±¾© 11ÔÂ23-24ÈÕÉϺ£ 11ÔÂ30-12ÔÂ1ÈÕÉîÛÚ
12ÔÂ28-29ÈÕ±±¾© 12ÔÂ24-25ÈÕÉϺ£ 12ÔÂ21-22ÈÕÉîÛÚ
This is the code that errors.
data = join([sys.stdin])
for line in sys.stdin:
data+=line.decode("utf8")
msg = email.message_from_string(data)
I don't even get far enough to have access to the charset= value. This could be a faked also, but at least I would have some clue.
I just tried this: data=sys.stdin.read(100) and got this:
File "/usr/lib64/python3.5/codecs.py", line 321, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb4 in position 546: invalid start byte
I didn't even ask for byte 546 and it still reads it even if I ask for 1 byte.
someone suggested python3 -u </here/whatever
that still errors.