0

I'm trying to scrape names from a chunk of text (from an email body actually) that normally looks similar to this:

From: aaa@aaa.com
CC: John Smith <aaa@aaa.com>, Charles <aaa@aaa.com>, Mary Lamb <aaa@aaa.com>, Chino <aaa@aaa.com>, Claudia <aaa@aaa.com>, <aaa@aaa.com>, <bbb@bbb.com>, John <aaa@aaa.com>
Hi there AAA! Hope you had a wonderful time
Best,
AAA

I would like to end up with a list variable that holds only the names (first and last if available) of everyone on the CC, discarding the rest of the information. What would be a simple and clean approach using regex? (this is not a test, it's a real app I'm working on and am stuck...). I already was able to extract all emails using a re.findall() with an email matching pattern I found.

Thanks

newyuppie
  • 1,054
  • 1
  • 8
  • 13
  • 1
    Check out [this question](http://stackoverflow.com/questions/6209910/parse-small-string-for-name-and-email). – Dave Chen Oct 25 '14 at 04:35
  • Dave, while its not exactly what I needed, that question did point me to some new things I am looking into right now. Thanks – newyuppie Oct 25 '14 at 17:33
  • @AvinashRaj true, but that's not really relevant since what I needed was to extract the name part regardless of what the name is – newyuppie Oct 25 '14 at 17:41

4 Answers4

3

You can use this regex:

[:,] ([\w ]+) \<

RegEx Demo


>>> p = re.compile(ur'[:,] ([\w ]+) \<') 
>>> m = re.findall(p, text)
>>> print m
['John Smith', 'Charles', 'Mary Lamb', 'Chino', 'Claudia', 'John']
Amal Murali
  • 75,622
  • 18
  • 128
  • 150
1

You could try the below.

>>> import re
>>> s = """From: aaa@aaa.com
... CC: John Smith <aaa@aaa.com>, Charles <aaa@aaa.com>, Mary Lamb <aaa@aaa.com>, Chino <aaa@aaa.com>, Claudia <aaa@aaa.com>, <aaa@aaa.com>, <bbb@bbb.com>, John <aaa@aaa.com>
... Hi there AAA! Hope you had a wonderful time
... Best,
... AAA"""
>>> re.findall(r'(?<=[:,]\s)[A-Z][a-z]+(?:\s[A-Z][a-z]+)?(?=\s<)', s)
['John Smith', 'Charles', 'Mary Lamb', 'Chino', 'Claudia', 'John']
Avinash Raj
  • 172,303
  • 28
  • 230
  • 274
0

Use the regex:

re.findall("(?:CC: |, )([\w ]*) <\S*@\S*>", str)
Andrew Luo
  • 919
  • 1
  • 5
  • 6
0

This will capture strictly what you need.

[:,]\s((?:(?![:,<]).)*)\s\<

use group 1 to get the text.

depsai
  • 405
  • 2
  • 14