Is there a way to get around unicode issues when using win32api/com modules in python 3?

Question

I've looked around and haven't found anything just yet. I'm going through emails in an inbox and checking for a specific word set. It works on most emails but some of them don't parse. I checked the broken emails using.

print (msg.Body.encode('utf8'))

and my problem messages all start with b'. like this

b'\xe6\xa0\xbc\xe6\xb5\xb4\xe3\xb9\xac\xe6\xa0\xbc\xe6\x85\xa5\xe3\xb9\xa4\xe0\xa8\x8d\xe6\xb4\xbc\xe7\x91\xa5\xe2\x81\xa1\xe7\x91\x

I think this is forcing python to read the body as bytes but I'm not sure. Either way after the b, no matter what encoding I try I don't get anything but garbage text.

I've tried other encoding methods as well decoding before but I'm just getting a ton of attribute errrors.

import win32api
import win32com.client
import datetime
import os
import time


outlook = win32com.client.Dispatch("Outlook.Application").GetNamespace("MAPI")
dater = datetime.date.today() - datetime.timedelta(days = 1)
dater = str(dater.strftime("%m-%d-%Y")) 
print (dater)
#for folders in outlook.folders:
#    print(folders)

Receipt = outlook.folders[8]

print(Receipt)

Ritems = Receipt.folders["Inbox"]

Rmessage = Ritems.items

for msg in Rmessage:
    if (msg.Class == 46 and msg.CreationTime.strftime("%m-%d-%Y") == dater):
        print (msg.CreationTime)
        print (msg.Subject)
        print (msg.Body.encode('utf8'))

        print ('..............................')

End result is to have the message printed out in the console, or at least give Python a way to read it so I can find the text I'm looking for in the body.

[This](https://learn.microsoft.com/en-us/windows/desktop/learnwin32/working-with-strings) says Windows uses UTF-16 encodings. — martineau, Jan 22 '19 at 01:30
This was the answer. This at least gave 95% of the messages in html formatting which I can work with. I'll have to figure out what encoding the other 5% randomly use. — evobe, Jan 22 '19 at 16:08
evobe: In that case I suggest you update your question accordingly (or post an answer to your own question). — martineau, Jan 22 '19 at 16:30

score 0 · Answer 1 · answered Jan 22 '19 at 01:42

The byte literal posted in the question is valid UTF-8. First two characters are U+683C and U+6D74 from the CJK Unified Ideographs block, U+4E00 - U+9FFF.

Since you don't know the source encoding there is no way to be completely sure about it, but chances are that email body is just Han characters encoded in UTF-8 (Determine the encoding of text in Python). If you are not being able to see the UTF-8 characters correctly you should check your terminal or display character set.

That said, you should to get the fundamentals of character representation right. Randomly encoding or decoding is hardly going to solve anything. I would suggest you begin by reading Spolsky's introduction to Unicode and then move to Batchelder on Unicode in Python.

The read was interesting. I did take a look at the source and since the charset said ascii I tried that along with utf-8 which I was familiar with. Either way thanks for providing the links. — evobe, Jan 22 '19 at 16:06

score 0 · Accepted Answer · answered Jan 22 '19 at 16:41

0

As martineau said the proper encoding I was searching for was utf16. The other messages were encoded using utf8. So a simple mail scrape turned out to be an excellent lesson in encoding as well message Classes (off topic). Thanks for the help.

answered Jan 22 '19 at 16:41

evobe

61
1
7

Is there a way to get around unicode issues when using win32api/com modules in python 3?

2 Answers2

Linked

Related