0

Input (comma separated list):

"\"Mr ABC\" <mr@abc.com>, \"Foo, Bar\" <foo@bar.com>, mr@xyz.com"

Expected output (list of 2-tuples):

[("Mr ABC", "mr@abc.com"), ("Foo, Bar", "foo@bar.com"), ("", "mr@xyz.com")]

I could actually use comma splitting and then use email.utils.parseaddr(address) until I realized that the name part can also have comma in it, like in "Foo, Bar" above.

email.utils.getaddresses(fieldvalues) is very close to what I need but it accepts a sequence, not a comma separated string.

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
Taranjeet Singh
  • 159
  • 1
  • 10
  • 1
    You could split at `>, ` – tobias_k Jul 23 '15 at 10:09
  • There is a useful header parsing method https://stackoverflow.com/questions/33511371/how-do-you-extract-multiple-email-addresses-from-an-rfc-2822-mail-header-in-pyth – Serge May 31 '17 at 17:47

2 Answers2

4

You may use the following

import re
p = re.compile(r'"([^"]+)"(?:\s+<([^<>]+)>)?')
test_str = '"Mr ABC" <mr@abc.com>, "Foo, Bar" <foo@bar.com>, "mr@xyz.com"'
print(re.findall(p, test_str))

Output: [('Mr ABC', 'mr@abc.com'), ('Foo, Bar', 'foo@bar.com'), ('mr@xyz.com', '')]

See IDEONE demo

The regex matches...

  • " - a double quote
  • ([^"]+) - (Group 1) 1 or more characters other than a double quote
  • " - a double quote

Then, an optional non-capturing group is introduced with (?:...)? construct: (?:\s+<([^<>]+)>)?. It matches...

  • \s+ - 1 or more whitespace characters
  • < - an opening angle bracket
  • ([^<>]+) - (Group 2) 1 or more characters other than opening or closing angle brackets
  • > - a closing angle bracket

The re.findall function gets all capture groups into a list of tuples:

If one or more groups are present in the pattern, return a list of groups; this will be a list of tuples if the pattern has more than one group.

UPDATE:

In case you need to make sure the email is the second element in the tuple, use this code (see demo):

lst = re.findall(p, test_str)
print([(tpl[1], tpl[0]) if not tpl[1] else tpl for tpl in lst])
# => [('Mr ABC', 'mr@abc.com'), ('Foo, Bar', 'foo@bar.com'), ('', 'mr@xyz.com')]
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • That's a beautiful solution but then I forgot to mention that we could also have emails with just the email part like 'mr@xyz.com'. Updated the question. – Taranjeet Singh Jul 23 '15 at 10:23
  • 1
    I have updated the answer, please look. Note that the tuples will still be created even if the email part is absent, and the email will be captured into group 1 since they are not inside `<>`. – Wiktor Stribiżew Jul 23 '15 at 10:27
  • 2
    In case your output needs to be `[('Mr ABC', 'mr@abc.com'), ('Foo, Bar', 'foo@bar.com'), ('', 'mr@xyz.com')]`, you can use `lst = re.findall(p, test_str) // print([(tpl[1], tpl[0]) if not tpl[1] else tpl for tpl in lst])`. – Wiktor Stribiżew Jul 23 '15 at 10:38
1

Please use getaddresses for that:

emails = getaddresses('"Mr ABC" <mr@abc.com>, "Foo, Bar" <foo@bar.com>, "mr@xyz.com"')

=> [('Mr ABC', 'mr@abc.com'), ('Foo, Bar', 'foo@bar.com'), ('', 'mr@xyz.com')]
Cyril N.
  • 38,875
  • 36
  • 142
  • 243
  • 1
    This works great, except you need to pass a list to getaddresses: emails = getaddresses(["Mr ABC" , "Foo, Bar" , "mr@xyz.com"]) --> [('Mr ABC', 'mr@abc.com'), ('Foo, Bar', 'foo@bar.com'), ('', 'mr@xyz.com')] – Emilia Apostolova Jan 02 '22 at 01:21