5

I am trying to parse an RFC 5322 compliant "From: " field in an e-mail message into two parts: the display-name, and the e-mail address, in Python 2.7 (the display-name could be empty). The familiar example is something like

John Smith <jsmith@example.org>

In above, John Smith is the display-name and jsmith@example.org is the email address. But the following is also a valid "From: " field:

"unusual" <"very.(),:;<>[]\".VERY.\"very@\\ \"very\".unusual"@strange.example.com>

In this example, the return value for display-name is

"unusual" 

and

"very.(),:;<>[]\".VERY.\"very@\\ \"very\".unusual"@strange.example.com

is the email address.

You can use grammars to parse this in Perl (as explained in these questions: Using a regular expression to validate an email address and The recognizing power of “modern” regexes), but I'd like to do this in Python 2.7. I have tried using email.parser module in Python, but that module seems only to be able to separate those fields that are distinguished by a colon. So, if you do something like

from email.parser import Parser
headers = Parser().parsestr('From: "John Smith" <jsmith@example.org>')
print headers['from'] 

it will return

"John Smith" <jsmith@example.com> 

while if you replace the last line in the above code with

print headers['display-name']

it will return

None

I'll very much appreciate any suggestions and comments.

Community
  • 1
  • 1
user765195
  • 423
  • 4
  • 13
  • I'd suggest getting it to work? You need to give more information about the problem before anyone can give more specific help. – alexis Oct 06 '13 at 23:11
  • Thanks. You're right. I'll try to clarify. – user765195 Oct 06 '13 at 23:13
  • The `headers['display-name']` does not make sense. The display-name is not a field of the header, but of the 1st email address in the From: ... header. – Alexis Wilke Oct 06 '13 at 23:54

2 Answers2

9

headers['display-name'] is not part of the email.parser api.

Try email.utils.parseaddr:

In [17]: email.utils.parseaddr("jsmith@example.com")
Out[17]: ('', 'jsmith@example.com')

In [18]: email.utils.parseaddr("(John Smith) jsmith@example.com")
Out[18]: ('John Smith', 'jsmith@example.com')

In [19]: email.utils.parseaddr("John Smith <jsmith@example.com>")
Out[19]: ('John Smith', 'jsmith@example.com')

It also handles your unusual address:

In [21]: email.utils.parseaddr('''"unusual" <"very.(),:;<>[]\".VERY.\"very@\\ \"very\".unusual"@strange.example.com>''')
Out[21]: ('unusual', '"very.(),:;<>[]".VERY."very@ "very".unusual"@strange.example.com')
Robᵩ
  • 163,533
  • 20
  • 239
  • 308
1

I wrote such a parser in libtld in C++. If you want to really be complete, there is the lex and yacc (although I do not use those tools). My C++ code may help you write your own version in python.

(lex part)
[-A-Za-z0-9!#$%&'*+/=?^_`{|}~]+                                          atom_text_repeat (ALPHA+DIGIT+some other characters)
([\x09\x0A\x0D\x20-\x27\x2A-\x5B\x5D-\x7E]|\\[\x09\x20-\x7E])+           comment_text_repeat
([\x33-\x5A\x5E-\x7E])+                                                  domain_text_repeat
([\x21\x23-\x5B\x5D-\x7E]|\\[\x09\x20-\x7E])+                            quoted_text_repeat
\x22                                                                     DQUOTE
[\x20\x09]*\x0D\x0A[\x20\x09]+                                           FWS
.                                                                        any other character

(lex definitions merged in more complex lex definitions)
[\x01-\x08\x0B\x0C\x0E-\x1F\x7F]                                         NO_WS_CTL
[()<>[\]:;@\\,.]                                                         specials
[\x01-\x09\x0B\x0C\x0E-\x7F]                                             text
\\[\x09\x20-\x7E]                                                        quoted_pair ('\\' text)
[A-Za-z]                                                                 ALPHA
[0-9]                                                                    DIGIT
[\x20\x09]                                                               WSP
\x20                                                                     SP
\x09                                                                     HTAB
\x0D\x0A                                                                 CRLF
\x0D                                                                     CR
\x0A                                                                     LF

(yacc part)
address_list: address
            | address ',' address_list
address: mailbox
       | group
mailbox_list: mailbox
            | mailbox ',' mailbox_list
mailbox: name_addr
       | addr_spec
group: display_name ':' mailbox_list ';' CFWS
     | display_name ':' CFWS ';' CFWS
name_addr: angle_addr
         | display_name angle_addr
display_name: phrase
angle_addr: CFWS '<' addr_spec '>' CFWS
addr_spec: local_part '@' domain
local_part: dot_atom
          | quoted_string
domain: dot_atom
      | domain_literal
domain_literal: CFWS '[' FWS domain_text_repeat FWS ']' CFWS
phrase: word
      | word phrase
word: atom
    | quoted_string
atom: CFWS atom_text_repeat CFWS
dot_atom: CFWS dot_atom_text CFWS
dot_atom_text: atom_text_repeat
             | atom_text_repeat '.' dot_atom_text
quoted_string: CFWS DQUOTE quoted_text_repeat DQUOTE CFWS
CFWS: <empty>
    | FWS comment
    | CFWS comment FWS
comment: '(' comment_content ')'
comment_content: comment_text_repeat
               | comment
               | ccontent ccontent
Alexis Wilke
  • 19,179
  • 10
  • 84
  • 156