13

I'm writing a Python script to process emails returned from Procmail. As suggested in this question, I'm using the following Procmail config:

:0:
|$HOME/process_mail.py

My process_mail.py script is receiving an email via stdin like this:

From hostname Tue Jun 15 21:43:30 2010
Received: (qmail 8580 invoked from network); 15 Jun 2010 21:43:22 -0400
Received: from mail-fx0-f44.google.com (209.85.161.44)
by ip-73-187-35-131.ip.secureserver.net with SMTP; 15 Jun 2010 21:43:22 -0400
Received: by fxm19 with SMTP id 19so170709fxm.3
for <username@domain.com>; Tue, 15 Jun 2010 18:47:33 -0700 (PDT)
MIME-Version: 1.0
Received: by 10.103.84.1 with SMTP id m1mr2774225mul.26.1276652853684; Tue, 15
Jun 2010 18:47:33 -0700 (PDT)
Received: by 10.123.143.4 with HTTP; Tue, 15 Jun 2010 18:47:33 -0700 (PDT)
Date: Tue, 15 Jun 2010 20:47:33 -0500
Message-ID: <AANLkTikFsIjJ3KYW1HJWcAqQlGXNiXE2YMzrj39I0tdB@mail.gmail.com>
Subject: TEST 12
From: Full Name <username@sender.com>
To: username@domain.com
Content-Type: text/plain; charset=ISO-8859-1

ONE
TWO
THREE

I'm trying to parse the message in this way:

>>> import email
>>> msg = email.message_from_string(full_message)

I want to get message fields like 'From', 'To' and 'Subject'. However, the message object does not contain any of these fields.

What am I doing wrong?

Community
  • 1
  • 1
Manuel Ceron
  • 8,268
  • 8
  • 31
  • 38

3 Answers3

10

You must ensure that the lines are not accidentally broken (as they are above, though it's hard to say if that was a copy-paste problem) -- with an intact message such as:

Received: (qmail 8580 invoked from network); 15 Jun 2010 21:43:22 -0400
Received: from mail-fx0-f44.google.com (209.85.161.44) by ip-73-187-35-131.ip.secureserver.net with SMTP; 15 Jun 2010 21:43:22 -0400
Received: by fxm19 with SMTP id 19so170709fxm.3 for <username@domain.com>; Tue, 15 Jun 2010 18:47:33 -0700 (PDT)
MIME-Version: 1.0
Received: by 10.103.84.1 with SMTP id m1mr2774225mul.26.1276652853684; Tue, 15 Jun 2010 18:47:33 -0700 (PDT)
Received: by 10.123.143.4 with HTTP; Tue, 15 Jun 2010 18:47:33 -0700 (PDT)
Date: Tue, 15 Jun 2010 20:47:33 -0500
Message-ID: <AANLkTikFsIjJ3KYW1HJWcAqQlGXNiXE2YMzrj39I0tdB@mail.gmail.com>
Subject: TEST 12
From: Full Name <username@sender.com>
To: username@domain.com
Content-Type: text/plain; charset=ISO-8859-1

ONE
TWO
THREE

then

msg = email.message_from_string(msgtxt)
print msg['Subject']

prints TEST 12 as desired.

Alex Martelli
  • 854,459
  • 170
  • 1,222
  • 1,395
  • How to get the body of the email here? โ€“ Anuj Feb 10 '14 at 13:09
  • If you truly want the entire RFC2822 email body with raw MIME structures and all, parsing the message in Python is basically superfluous; the body is everything after the first empty line. Normally, with modern messages, you want to parse the MIME structure and extract one or more body parts. โ€“ tripleee Jun 22 '16 at 09:35
5

It looks like you have linefeeds without spaces prepended to the additional lines, which according to RFC 2822 ยง2.3.2 is illegal:

Each header field is logically a single line of characters comprising
the field name, the colon, and the field body. For convenience
however, and to deal with the 998/78 character limitations per line,
the field body portion of a header field can be split into a multiple
line representation; this is called "folding". The general rule is
that wherever this standard allows for folding white space (not
simply WSP characters), a CRLF may be inserted before any WSP. For
example, the header field:

    Subject: This is a test

can be represented as:

    Subject: This
     is a test

It should look something like this:

From hostname Tue Jun 15 21:43:30 2010
Received: (qmail 8580 invoked from network); 15 Jun 2010 21:43:22 -0400
Received: from mail-fx0-f44.google.com (209.85.161.44)
    by ip-73-187-35-131.ip.secureserver.net with SMTP; 15 Jun 2010 21:43:22 -0400
Received: by fxm19 with SMTP id 19so170709fxm.3
    for <username@domain.com>; Tue, 15 Jun 2010 18:47:33 -0700 (PDT)
MIME-Version: 1.0
Received: by 10.103.84.1 with SMTP id m1mr2774225mul.26.1276652853684; Tue, 15
    Jun 2010 18:47:33 -0700 (PDT)
Received: by 10.123.143.4 with HTTP; Tue, 15 Jun 2010 18:47:33 -0700 (PDT)
Date: Tue, 15 Jun 2010 20:47:33 -0500
Message-ID: <AANLkTikFsIjJ3KYW1HJWcAqQlGXNiXE2YMzrj39I0tdB@mail.gmail.com>
Subject: TEST 12
From: Full Name <username@sender.com>
To: username@domain.com
Content-Type: text/plain; charset=ISO-8859-1

ONE
TWO
THREE
Community
  • 1
  • 1
Michael Mrozek
  • 169,610
  • 28
  • 168
  • 175
  • So just to clarify, if the raw file says `Subject: This\r\n is a test`, then `email.message_from_string()` *should* say the subject is `This is a test` (no whitespace). I'm finding that for a particular email with such wrapping for attachment name (`Content-Disposition`), the funny `\r\n` is not stripped. โ€“ falsePockets Apr 21 '20 at 06:34
2

I answer to myself.

I found a bug in the code that builds the messages. It's appending linebreaks between some lines, preventing the parser from working properly.

Manuel Ceron
  • 8,268
  • 8
  • 31
  • 38