I STILL DO NOT HAVE A SOLUTION.
I am writing a mail processing program. I have a big INBOX file downloaded (by Thunderbird) that has my gmail.
It runs on some mailboxes, but not the INBOX from gmail. It goes for a long while, but then I get a UnicodeDecodeError
exception with a message
'ascii' codec can't decode byte 0xe2 in position 56: ordinal not in range(128)
.
Fixing the decoding logic is one possibility, but after trying various decoding strings, I either still get an exception, or I miss processing most messages.
I accept the possibility that some messages may be invalid or cannot be decoded. Skipping them is ok, but I cannot figure out how to do so, since the exception occurs while running code underlying the for-loop implementation, such that I cannot use a try/except to catch and skip bad messages.
The traceback includes only one line from my program, which is this line:
for message in mbox:
That appears to call itervalues, which calls __getitem__
, which calls get_message
in mailbox.py
. I don't know the mechanics of the for loop in Python, but itervalues would seem to be the way the for loop iterates over all messages in the mbox, and it does that by calling a generic __getitem__
, which calls a mailbox.get_message()
.
If there is something wrong with a single message, that's fine, but I would like to skip it and move on. The problem is that since I am not making any API calls, I don't know where I would put a try / except handler. I guess I could wrap the entire for loop with a handler, but that would not allow me to continue to the next record.
I can reproduce the problem with just a few lines:
import mailbox
mbox = mailbox.mbox('INBOX')
print(str(mbox.__len__()) + ' messages in mbox')
processed=0
for message in mbox:
processed += 1
if processed % 10000 == 0:
print('processed ', processed, ' so far')
The exception happens somewhere after 30k messages in a file that has almost 200k.
Can someone suggest how to trap the exception, allowing me to skip the broken one and continue?
UPDATE: Here's the traceback resulting from the exception:
Traceback (most recent call last):
File "C:\Users\Mark Colan\.p2\pool\plugins\org.python.pydev_4.5.5.201603221110\pysrc\pydevd.py", line 1529, in <module>
globals = debugger.run(setup['file'], None, None, is_module)
File "C:\Users\Mark Colan\.p2\pool\plugins\org.python.pydev_4.5.5.201603221110\pysrc\pydevd.py", line 936, in run
pydev_imports.execfile(file, globals, locals) # execute the script
File "C:\Users\Mark Colan\.p2\pool\plugins\org.python.pydev_4.5.5.201603221110\pysrc\_pydev_imps\_pydev_execfile.py", line 18, in execfile
exec(compile(contents+"\n", file, 'exec'), glob, loc)
File "D:\Dev\Sandbox2\Sandbox2.py", line 6, in <module>
for message in mbox:
File "C:\Program Files\Python35\lib\mailbox.py", line 108, in itervalues
value = self[key]
File "C:\Program Files\Python35\lib\mailbox.py", line 72, in __getitem__
return self.get_message(key)
File "C:\Program Files\Python35\lib\mailbox.py", line 779, in get_message
msg.set_from(from_line[5:].decode('ascii'))
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 56: ordinal not in range(128)