0

I STILL DO NOT HAVE A SOLUTION.

I am writing a mail processing program. I have a big INBOX file downloaded (by Thunderbird) that has my gmail.

It runs on some mailboxes, but not the INBOX from gmail. It goes for a long while, but then I get a UnicodeDecodeError exception with a message 'ascii' codec can't decode byte 0xe2 in position 56: ordinal not in range(128).

Fixing the decoding logic is one possibility, but after trying various decoding strings, I either still get an exception, or I miss processing most messages.

I accept the possibility that some messages may be invalid or cannot be decoded. Skipping them is ok, but I cannot figure out how to do so, since the exception occurs while running code underlying the for-loop implementation, such that I cannot use a try/except to catch and skip bad messages.

The traceback includes only one line from my program, which is this line:

for message in mbox:

That appears to call itervalues, which calls __getitem__, which calls get_message in mailbox.py. I don't know the mechanics of the for loop in Python, but itervalues would seem to be the way the for loop iterates over all messages in the mbox, and it does that by calling a generic __getitem__, which calls a mailbox.get_message().

If there is something wrong with a single message, that's fine, but I would like to skip it and move on. The problem is that since I am not making any API calls, I don't know where I would put a try / except handler. I guess I could wrap the entire for loop with a handler, but that would not allow me to continue to the next record.

I can reproduce the problem with just a few lines:

import mailbox
mbox = mailbox.mbox('INBOX')
print(str(mbox.__len__()) + ' messages in mbox')
processed=0
for message in mbox:
    processed += 1
    if processed % 10000 == 0:
        print('processed ', processed, ' so far')

The exception happens somewhere after 30k messages in a file that has almost 200k.

Can someone suggest how to trap the exception, allowing me to skip the broken one and continue?

UPDATE: Here's the traceback resulting from the exception:

Traceback (most recent call last):

  File "C:\Users\Mark Colan\.p2\pool\plugins\org.python.pydev_4.5.5.201603221110\pysrc\pydevd.py", line 1529, in <module>
    globals = debugger.run(setup['file'], None, None, is_module)
  File "C:\Users\Mark Colan\.p2\pool\plugins\org.python.pydev_4.5.5.201603221110\pysrc\pydevd.py", line 936, in run
    pydev_imports.execfile(file, globals, locals)  # execute the script
  File "C:\Users\Mark Colan\.p2\pool\plugins\org.python.pydev_4.5.5.201603221110\pysrc\_pydev_imps\_pydev_execfile.py", line 18, in execfile
    exec(compile(contents+"\n", file, 'exec'), glob, loc)
  File "D:\Dev\Sandbox2\Sandbox2.py", line 6, in <module>
    for message in mbox:
  File "C:\Program Files\Python35\lib\mailbox.py", line 108, in itervalues
    value = self[key]
  File "C:\Program Files\Python35\lib\mailbox.py", line 72, in __getitem__
    return self.get_message(key)
  File "C:\Program Files\Python35\lib\mailbox.py", line 779, in get_message
    msg.set_from(from_line[5:].decode('ascii'))
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 56: ordinal not in range(128)
Mark Colan
  • 454
  • 4
  • 15
  • 4
    Please provide the full stack trace please, describing it in words isn't nearly as helpful as just providing it. – Tadhg McDonald-Jensen Jun 17 '16 at 20:48
  • Simple suggestion, `try: ... except UnicodeDecodeError as e: print e; pass`;-) around that loop or tighter until ... you have found the "place" or processed the interesting messages. – Dilettant Jun 17 '16 at 20:54
  • @Dilettant `print e` isn't going to give you nearly as much information as the raw traceback message, what is the purpose of doing that? – Tadhg McDonald-Jensen Jun 17 '16 at 20:56
  • I added the traceback. I don't think putting track/except around the for loop is going to help isolate the problem, since it appears that it is not in the code that is iterated, but in the for iteration handling itself. – Mark Colan Jun 17 '16 at 21:00
  • I wanted to suggest `try: ... except UnicodeDecodeError: pass` but felt more investigative - it is a late European evening. Seriously: OP seems to be more in the skipping buisness, than in archeology ;-) ... that print seemed like a well balanced compromise. – Dilettant Jun 17 '16 at 21:00
  • `b"\xe2".decode("ascii")` raises that error in python2 but not in python 3, switching versions may fix it. – Tadhg McDonald-Jensen Jun 17 '16 at 21:05
  • Someone downvoted me. I sure wish that when that happens, that a comment would be added so I could understand what's wrong with my question. – Mark Colan Jun 17 '16 at 21:08
  • Tadhg, I use python 3.5.1 – Mark Colan Jun 17 '16 at 21:09
  • I edited your title to give a better idea about your problem. Though your suggested fix (trap the iterator error) is also an interesting question... – alexis Jun 17 '16 at 21:25
  • I did not indicate the exception because I don't consider it important. There could be any number of exceptions on a given message, and I need to catch them. The fix won't be to the mbox code, presumably, but rather in the for loop, trapping the exception and continuing. The mbox code doesn't have much choice but to throw when it gets bad data, but I need to be able to handle all the data in my mbox. – Mark Colan Jun 17 '16 at 21:45
  • What's important is that it's a problem with reading the mailbox-- don't be so quick to give up on fixing it! – alexis Jun 17 '16 at 21:50
  • I know nothing about mail and encoding, so naturally I jumped to the only solution available - skip it if it cannot be read. – Mark Colan Jun 17 '16 at 21:53

2 Answers2

2

A UnicodeDecodeError doesn't mean that your mailbox is broken, but that it contains characters beyond the Ascii range. It sounds like you can afford to skip the broken messages (if you didn't have a problem with the first thirty thousand messages, there might not be more than a handful in the whole file), but wouldn't it be better to actually fix the problem?

According to the mailbox documentation, messages are read in binary format from the file; you are getting an error when the mailbox tries to convert them to Unicode, and assumes the ASCII encoding. So, try providing a "factory method" that does its own conversion, then delegates to the default class:

def mbox_reader(stream):
    """Read a non-ascii message from mailbox"""
    data = stream.read()
    text = data.decode(encoding="utf-8")
    return mailbox.mboxMessage(text)

mbox = mailbox.mbox('INBOX', factory=mbox_reader)
for message in mbox:
    ...

Try this, and if you still get errors, change the encoding from "utf-8" to "latin-1", or whatever is likely to be the default for your Thunderbird. If it still doesn't work, you can still read the problem messages by telling python to replace unreadable characters with a special symbol:

text = data.decode(encoding="utf-8", errors="replace")

With this setting, instead of UnicodeDecodeError you'll just get that funny questionmark glyph in the message.

alexis
  • 48,685
  • 16
  • 101
  • 161
  • I'll try it. Yes, always better to fix the problem rather than mask it, which was my approach. But I'm a Python newbie and the intricacies of email encoding are well outside of my experience. – Mark Colan Jun 17 '16 at 21:47
  • I would have assumed that the message would have a header stating which coding was used. Your code switches it to always use utf-8 instead of ascii. Is it likely that all messages in the file are one or the other, if not unicode? – Mark Colan Jun 17 '16 at 21:50
  • Good question, you may be right. I don't really know. Starting with my factory function, you could trap a UnicodeDecodeError (just stick to the "ascii" encoding) and print out the current message unparsed, to see what it really provides. I'm guessing you can see how to do that now. – alexis Jun 17 '16 at 21:53
  • File "D:\Dev\Sandbox2\Sandbox2.py", line 7, in check_each yield next(it) File "C:\Program Files\Python35\lib\mailbox.py", line 108, in itervalues value = self[key] File "C:\Program Files\Python35\lib\mailbox.py", line 75, in __getitem__ return self._factory(file) File "D:\Dev\Sandbox2\Sandbox2.py", line 17, in mbox_reader text = data.decode(encoding="utf-8") UnicodeDecodeError: 'utf-8' codec can't decode byte 0x92 in position 8000: invalid start byte – Mark Colan Jun 17 '16 at 21:53
  • Ok, so it's not utf-8. Try `"latin-1"` or the error handler. (With latin 1 you should not get any errors, just the occasional garbage.) – alexis Jun 17 '16 at 21:55
  • How do I dump the data? Python function, or code-my-own? – Mark Colan Jun 17 '16 at 21:55
  • [0x92 is a smart quote of Windows-1252](http://stackoverflow.com/a/29419477/699305). So try `data.decode("cp1252")`. – alexis Jun 17 '16 at 21:58
  • To dump the data, you would add debugging code to `mbox_reader()`. Gotta go, sorry. Good luck. – alexis Jun 17 '16 at 22:00
  • "latin-1" allowed it to run to completion. Trying that now in my processing program. cp1252 barfed probably on the same message. – Mark Colan Jun 17 '16 at 23:32
1

The internal mechanics of a for loop is something like this:

#for message in mbox:
#    do_stuff()

it = iter(mbox)
try:
    while True:
        message = next(it)
        do_stuff()
except StopIteration:
    pass

so you could handle the iterator manually or use a generator to catch other exceptions when an error is raised:

import traceback
def check_each(iterable):
    it = iter(iterable)
    while True:
        try:
            yield next(it)
        except StopIteration:
            return
        except Exception as e:
            print("Exception was caught!")
            traceback.print_exc()
            continue #keep going

Then you can do for message in check_each(mbox): and it will show the full traceback of each time an error happens without halting your program.

Tadhg McDonald-Jensen
  • 20,699
  • 5
  • 35
  • 59