17

Python newbie here. I want to walk through a large mbox file, parsing email messages. I can do that with:

import sys
import mailbox

def gen_summary(filename):
    mbox = mailbox.mbox(filename)
    for message in mbox:
       subj = message['subject']
       print subj

if __name__ == "__main__":
    if len(sys.argv) != 2:
        print 'Usage: python genarchivesum.py mbox'
        sys.exit(1)

    gen_summary(sys.argv[1])

But I need more control. I need to be able to get the byte position of the start of a given email in the mbox file and I also need to get the number of bytes in the message (as represented on disk). And then in the future, instead of iterating from the beginning of the mbox file, I need to be able to seek to a given message and just parse that (hence one of the needs of getting the byte position on disk). These are large mbox files and efficiency is a concern.

The purpose of all this is so that I can generate a summary file, which contains some small bits about each email in the mbox, and then in the future efficiently look up individual emails within the mbox.

Mark Fletcher
  • 701
  • 1
  • 14
  • 36
  • I've never used `mailbox`, but I just read `help(mailbox.mbox)`. Can't you use the `.iterkeys()` method to get an iterator of key values, and then use the key values to find messages? Why do you want to use a byte index as a key to find a message instead of using the module... have you tried using the module to index messages by key? If you've tried it and it's too slow or something, please say so. – steveha Apr 20 '12 at 19:05
  • Say I've got an mbox of 10,000 emails. I don't want to have to read in/parse/iterate over 9,998 of them when I just want the last email. I'd like to seek to that point in the mbox file and just read that message. – Mark Fletcher Apr 20 '12 at 19:43

1 Answers1

10

I haven't tested this, but something like this might work for you. Just open the file (in binary mode so your byte counts are correct), and scan through it, finding messages.

def is_mail_start(line):
    return line.startswith("From ")

def build_index(fname):
    with open(fname, "rb") as f:
        i = 0
        b = 0
        # find start of first message
        for line in f:
            b += len(line)
            if is_mail_start(line):
                break
        # find start of each message, and yield up (index, length) of previous message
        for line in f:
            if is_mail_start(line):
                yield (i, b)
                i += b
                b = 0
            b += len(line)
        yield (i, b) # yield up (index, length) of last message

# get index as a list
mbox_index = list(build_index(fname))

Once you have the index, you can use the .seek() method on a file object to seek there, and .read(length) on the file object to read just one message. I'm not sure how you will use the mailbox module with a string, though; I think it is meant to work on a mailbox in-place. Maybe there is some other mail-parsing module you can use.

steveha
  • 74,789
  • 21
  • 92
  • 117
  • 1
    Ok, thanks. I guess I'll use something like this strategy. btw, the start of an email in an mbox begins with 'From ' (without the :). I can use email.Parser to parse the email. Thanks. – Mark Fletcher Apr 20 '12 at 22:18
  • I'll edit the answer to take out the ':'. I *did* say I didn't test it... Good luck with your project, and have a great weekend! – steveha Apr 20 '12 at 22:31
  • For what it's worth, for future users, it's actually both, at least on the latest version of OSX. def is_mail_start(line): return line.startswith("From") and not line.startswith("From:") – adammenges Jun 12 '15 at 18:43
  • If the `From` that marks the start is always followed by a space, you could just search for the string `"From "` (note the space at the end). This wouldn't match `From:` with a colon. – steveha Jun 13 '15 at 05:34