12

I'd like to read a big 3GB .mbox file coming from a Gmail backup. This works:

import mailbox
mbox = mailbox.mbox(r"D:\All mail Including Spam and Trash.mbox")
for i, message in enumerate(mbox):
    print("from   :",message['from'])
    print("subject:",message['subject'])
    if message.is_multipart():
        content = ''.join(part.get_payload(decode=True) for part in message.get_payload())
    else:
        content = message.get_payload(decode=True)
    print("content:",content)
    print("**************************************")

    if i == 10:
        break

except it takes more than 40 seconds for the first 10 messages only.

Is there a faster way to access to a big .mbox file with Python?

Basj
  • 41,386
  • 99
  • 383
  • 673
  • 1
    I *think* the `mailbox` library reads it all into memory. It should not be hard to rewrite a simple `mbox` parser as a generator (in brief, any line which starts with `From ` starts a new message). – tripleee Jan 10 '20 at 12:27
  • 1
    No, `for message in mailbox.mbox()` doesn't read it all into memory. It iterates over messages efficiently, one at a time, using a generator. But it does pre-populate a small internal TOC structure on first access, which can take time. – user124114 Jun 01 '22 at 17:13

2 Answers2

13

Here's a quick and dirty attempt to implement a generator to read in an mbox file message by message. I have opted to simply ditch the information from the From separator; I'm guessing maybe the real mailbox library might provide more information, and of course, this only supports reading, not searching or writing back to the input file.

#!/usr/bin/env python3

import email
from email.policy import default

class MboxReader:
    def __init__(self, filename):
        self.handle = open(filename, 'rb')
        assert self.handle.readline().startswith(b'From ')

    def __enter__(self):
        return self

    def __exit__(self, exc_type, exc_value, exc_traceback):
        self.handle.close()

    def __iter__(self):
        return iter(self.__next__())

    def __next__(self):
        lines = []
        while True:
            line = self.handle.readline()
            if line == b'' or line.startswith(b'From '):
                yield email.message_from_bytes(b''.join(lines), policy=default)
                if line == b'':
                    break
                lines = []
                continue
            lines.append(line)

Usage:

with MboxReader(mboxfilename) as mbox:
    for message in mbox:
        print(message.as_string())

The policy=default argument (or any policy instead of default if you prefer, of course) selects the modern EmailMessage library which was introduced in Python 3.3 and became official in 3.6. If you need to support older Python versions from before America lost its mind and put an evil clown in the White House simpler times, you will want to omit it; but really, the new API is better in many ways.

tripleee
  • 175,061
  • 34
  • 275
  • 318
  • I had to google details about context managers and iterables so this might still have bugs. Feedback welcome. – tripleee Jan 10 '20 at 13:18
  • Thank you for your answer, it works! By the way, it seems that you answered https://stackoverflow.com/questions/59681576/show-the-full-original-source-of-an-email-with-python at the same time (isn't the answer `message.as_string()`? if so, feel free to post this command there) – Basj Jan 10 '20 at 14:03
  • PS @tripleee: `message.as_string()` sometimes generates errors for me: `KeyError: 'content-transfer-encoding'`. I wonder if this was handled specifically in the original `mailbox` module I was using? Here is the full traceback: https://pastebin.com/K2xKCSKG – Basj Jan 10 '20 at 14:04
  • This looks like a bug in the `email` module. Can you share (an approximation of) a message which causes this traceback? – tripleee Jan 10 '20 at 14:09
  • I'll have a deeper look on this and I'll share it this week-end. Thanks again for your help! – Basj Jan 10 '20 at 14:10
  • 1
    If you only want the bare message source, without the `email` object encapsulation, you can drop the `import`s and just `yield(b''.join(lines))` – tripleee Jan 10 '20 at 14:17
  • 3
    I'm astonished this isn't in the mailbox standard library module. Most of the Python standard library is iterator-friendly. Thanks for providing it! – Jim Pivarski May 20 '21 at 19:43
  • @tripleee You should make a PR to include this in the Python stdlib / mailbox module! So much better than the original very slow mbox reader! – Basj Feb 01 '22 at 22:30
  • 2
    No, the built-in `mailbox.mbox` does NOT load the entire archive into RAM. But it does pre-load and cache a (small) TOC structure that maps each message position (int) to its file byte offset (two ints). This TOC can can take time to create on first access. – user124114 Jun 01 '22 at 17:13
  • can this answer extract attached file or their size? – jokoon Oct 01 '22 at 12:21
  • @jokoon On its own, no; but the `EmailMessage`s you get back has methods for this. This comment space is too small to give the topic a fuller treatment, but it should not be hard to find existing questions about how to do this in Python. Keep in mind, though, that many older answers will target the somewhat clunkier and less versatile `email.Message` legacy API; however, many of the same methods are also available in the newer one (but a better and more straightforward new method might also be available). – tripleee Oct 01 '22 at 15:18
1

Using the MboxReader Class mentioned here this link you can use any of the keys, to get specific info from the mbox object. Then can create text file for further analysis of your mailbox.

path = "your_gmail.mbox"
mbox = MboxReader(path)
from tqdm import tqdm

with open('Output.txt','w',encoding="utf-8") as file:
    for idx,message in tqdm(enumerate(mbox)):
        # print(message.keys())
        mail_from = f"{str(message['From'])}\n".replace('"','')
        file.write(mail_from)
        print(idx,message['From'])

The following keys are allowed to be used, putting here for reference

['X-GM-THRID', 'X-Gmail-Labels', 'Delivered-To', 'Received', 'X-Received',
 'ARC-Seal', 'ARC-Message-Signature', 'ARC-Authentication-Results', 
'Return-Path', 'Received', 'Received-SPF', 'Authentication-Results', 
'DKIM-Signature', 'X-Google-DKIM-Signature', 'X-Gm-Message-State', 
'X-Google-Smtp-Source', 'MIME-Version', 'X-Received', 'Date', 'Reply-To',
 'X-Google-Id', 'Precedence', 'List-Unsubscribe', 'Feedback-ID', 'List-Id',
 'X-Notifications', 'X-Notifications-Bounce-Info', 'Message-ID', 'Subject',
 'From', 'To', 'Content-Type']

Hope it was useful :)

Vinay Verma
  • 877
  • 8
  • 15
  • 2
    Tangentially, there is no "above"; the answers on this page will be ordered according to each visitor's preference (the default is to sort by score, in which case the other answer will currently indeed be sorted above this one). – tripleee Jul 29 '22 at 09:56
  • Thanks for the answer, but I would rather not use TDQM, it's not really necessary, you can use native `sys.stdout.write` and `sys.stdout.write` instead – jokoon Oct 01 '22 at 09:11
  • Sorry I meant `sys.stdout.write()` and `sys.stdout.flush()` – jokoon Oct 01 '22 at 09:27
  • You can remove the tqdm, it was just to check how many mails are remaining. Mbox file can be pretty big – Vinay Verma Dec 09 '22 at 09:38
  • The random splattering of extracted header names is not particularly informative or useful; many of these headers are nonstandard or optional, and some optional but standard headers are missing. – tripleee May 04 '23 at 13:35