1

I'm trying to extract files from emails via IMAP using Python 3.7 (on Windows, fyi) and each of my attempts shows extracted files with Modification & Creation Date = time of extraction (which is incorrect).
As full email applications have the ability to preserve that information, it must me stored somewhere. I also gave working with structs a try, thinking the information may be stored in binary, but had no luck.

import email
from email.header import decode_header
import imaplib
import os

SERVER = None
OUT_DIR = '/var/out'
IMP_SRV = 'mail.domain.tld'
IMP_USR = 'user@domain.tld'
IMP_PWD = 'hunter2'

def login_mail():
    global SERVER
    SERVER = imaplib.IMAP4_SSL(IMP_SRV)
    SERVER.login(IMP_USR, IMP_PWD)


def get_mail(folder='INBOX'):
    mails = []
    _, data = SERVER.uid('SEARCH', 'ALL')
    uids = data[0].split()

    for uid in uids:
        _, s = SERVER.uid('FETCH', uid, '(RFC822)')
        mail = email.message_from_bytes(s[0][1])
        mails.append(mail)

    return mails


def parse_attachments(mail):
    for part in mail.walk():
        if part.get_content_type() == 'application/octet-stream':
            filename = get_filename(part)
            output = os.path.join(OUT_DIR, filename)
            with open(output, 'wb') as f:
                f.write(part.get_payload(decode=True))

def get_filename(part):
    filename = part.get_filename()
    binary = part.get_payload(decode=True)
    if decode_header(filename)[0][1] is not None:
        filename = decode_header(filename)[0][0].decode(decode_header(filename)[0][1])
    filename = os.path.basename(filename)

    return filename

Can anyone tell me what I'm doing wrong and if it's somehow possible?
After getting said information it could be possible to modify the timestamps utilizing How do I change the file creation date of a Windows file?.

schlumpfpirat
  • 195
  • 2
  • 12
  • You can use any of the following as time: the `Date` header from the message, the `INTERNALDATE` fetch item, or the `Received` header date. I would use `INTERNALDATE`, it is usually the time your server received the message and first stored it. – Max Apr 14 '20 at 16:25
  • That approach would be quick, it would however lead to falsifying documents e.g. if someone resends an attachment from a year ago. Ordinary email applications pull that information from somewhere too, so it must be there. For example some attachments have an attribute stating the modification date, e.g. `modification-date="Mon, 6 Apr 2020 08:17:00 +0000"` – this doesn't seem to be accessible using `email` however. – schlumpfpirat Apr 14 '20 at 17:11
  • 1
    That's what INTERNALDATE fetch item is for. If you fetch that item, it's set by the server when it receives the message. That's the most reliable date there is, and isn't set by the sender. It's not part of the message. It's like the FLAGS. – Max Apr 14 '20 at 21:03
  • I solved this issue by either taking the eMail creation date or extracting the PDF-embedded creation date and then modifying the file afterwards. Thanks so much – this was the right approach! – schlumpfpirat May 06 '20 at 09:31

1 Answers1

0

I was able to extract the creation-date and modification-date from the content-disposition header. Setting the file modified date is simple too.

attachment_creation_date = attachment.get_param('creation-date', None, 'content-disposition')
attachment_modification_date = attachment.get_param('modification-date', None, 'content-disposition')

Here's a more complete example that shows how to read these parameters if present:

def process_email_attachments(msg, output_directory):
    for attachment in msg.iter_attachments():
        try:
            output_filename = attachment.get_filename()
        except AttributeError:
            print("Couldn't get attachment filename. Skipping.")
            continue

        # If no attachments are found, skip this file
        if output_filename:
            attachment_creation_date = attachment.get_param('creation-date', None, 'content-disposition')
            attachment_modification_date = attachment.get_param('modification-date', None, 'content-disposition')
            try:
                output_file_full_path = os.path.join(output_directory, output_filename)
                with open(output_file_full_path, "wb") as of:
                    payload = attachment.get_payload(decode=True)
                    of.write(payload)

                if attachment_modification_date is not None:
                    attachment_modification_datetime = email.utils.parsedate_to_datetime(attachment_modification_date)
                    set_file_last_modified(output_file_full_path, attachment_modification_datetime)
            except TypeError:
                print("Couldn't get payload for %s" % output_filename)


def set_file_last_modified(file_path, dt):
    dt_epoch = dt.timestamp()
    os.utime(file_path, (dt_epoch, dt_epoch))

The second part of your question is how to set the file created date. This is platform dependent. There is already a separate question with answers demonstrating how to set the creation date on a Windows file: How do I change the file creation date of a Windows file?

Jon
  • 9,156
  • 9
  • 56
  • 73