Parsing "From:" field of an e-mail message in Python

Question

I am trying to parse an RFC 5322 compliant "From: " field in an e-mail message into two parts: the display-name, and the e-mail address, in Python 2.7 (the display-name could be empty). The familiar example is something like

John Smith <jsmith@example.org>

In above, John Smith is the display-name and jsmith@example.org is the email address. But the following is also a valid "From: " field:

"unusual" <"very.(),:;<>[]\".VERY.\"very@\\ \"very\".unusual"@strange.example.com>

In this example, the return value for display-name is

"unusual"

and

"very.(),:;<>[]\".VERY.\"very@\\ \"very\".unusual"@strange.example.com

is the email address.

You can use grammars to parse this in Perl (as explained in these questions: Using a regular expression to validate an email address and The recognizing power of “modern” regexes), but I'd like to do this in Python 2.7. I have tried using email.parser module in Python, but that module seems only to be able to separate those fields that are distinguished by a colon. So, if you do something like

from email.parser import Parser
headers = Parser().parsestr('From: "John Smith" <jsmith@example.org>')
print headers['from']

it will return

"John Smith" <jsmith@example.com>

while if you replace the last line in the above code with

print headers['display-name']

it will return

None

I'll very much appreciate any suggestions and comments.

I'd suggest getting it to work? You need to give more information about the problem before anyone can give more specific help. — alexis, Oct 06 '13 at 23:11
The `headers['display-name']` does not make sense. The display-name is not a field of the header, but of the 1st email address in the From: ... header. — Alexis Wilke, Oct 06 '13 at 23:54

Robᵩ · Accepted Answer · 2013-10-06T23:57:32.707

headers['display-name'] is not part of the email.parser api.

Try email.utils.parseaddr:

In [17]: email.utils.parseaddr("jsmith@example.com")
Out[17]: ('', 'jsmith@example.com')

In [18]: email.utils.parseaddr("(John Smith) jsmith@example.com")
Out[18]: ('John Smith', 'jsmith@example.com')

In [19]: email.utils.parseaddr("John Smith <jsmith@example.com>")
Out[19]: ('John Smith', 'jsmith@example.com')

It also handles your unusual address:

In [21]: email.utils.parseaddr('''"unusual" <"very.(),:;<>[]\".VERY.\"very@\\ \"very\".unusual"@strange.example.com>''')
Out[21]: ('unusual', '"very.(),:;<>[]".VERY."very@ "very".unusual"@strange.example.com')

Thanks! This is perfect! It's exactly what I was looking for. — user765195, Oct 07 '13 at 00:05

score 1 · Answer 2 · answered Oct 06 '13 at 23:52

I wrote such a parser in libtld in C++. If you want to really be complete, there is the lex and yacc (although I do not use those tools). My C++ code may help you write your own version in python.

(lex part)
[-A-Za-z0-9!#$%&'*+/=?^_`{|}~]+                                          atom_text_repeat (ALPHA+DIGIT+some other characters)
([\x09\x0A\x0D\x20-\x27\x2A-\x5B\x5D-\x7E]|\\[\x09\x20-\x7E])+           comment_text_repeat
([\x33-\x5A\x5E-\x7E])+                                                  domain_text_repeat
([\x21\x23-\x5B\x5D-\x7E]|\\[\x09\x20-\x7E])+                            quoted_text_repeat
\x22                                                                     DQUOTE
[\x20\x09]*\x0D\x0A[\x20\x09]+                                           FWS
.                                                                        any other character

(lex definitions merged in more complex lex definitions)
[\x01-\x08\x0B\x0C\x0E-\x1F\x7F]                                         NO_WS_CTL
[()<>[\]:;@\\,.]                                                         specials
[\x01-\x09\x0B\x0C\x0E-\x7F]                                             text
\\[\x09\x20-\x7E]                                                        quoted_pair ('\\' text)
[A-Za-z]                                                                 ALPHA
[0-9]                                                                    DIGIT
[\x20\x09]                                                               WSP
\x20                                                                     SP
\x09                                                                     HTAB
\x0D\x0A                                                                 CRLF
\x0D                                                                     CR
\x0A                                                                     LF

(yacc part)
address_list: address
            | address ',' address_list
address: mailbox
       | group
mailbox_list: mailbox
            | mailbox ',' mailbox_list
mailbox: name_addr
       | addr_spec
group: display_name ':' mailbox_list ';' CFWS
     | display_name ':' CFWS ';' CFWS
name_addr: angle_addr
         | display_name angle_addr
display_name: phrase
angle_addr: CFWS '<' addr_spec '>' CFWS
addr_spec: local_part '@' domain
local_part: dot_atom
          | quoted_string
domain: dot_atom
      | domain_literal
domain_literal: CFWS '[' FWS domain_text_repeat FWS ']' CFWS
phrase: word
      | word phrase
word: atom
    | quoted_string
atom: CFWS atom_text_repeat CFWS
dot_atom: CFWS dot_atom_text CFWS
dot_atom_text: atom_text_repeat
             | atom_text_repeat '.' dot_atom_text
quoted_string: CFWS DQUOTE quoted_text_repeat DQUOTE CFWS
CFWS: <empty>
    | FWS comment
    | CFWS comment FWS
comment: '(' comment_content ')'
comment_content: comment_text_repeat
               | comment
               | ccontent ccontent

Ah! It wasn't clear in the question that you did not want to write the actual parser. 8-) — Alexis Wilke, Oct 09 '13 at 00:51

Parsing "From:" field of an e-mail message in Python

2 Answers2