0

Python3

I need help creating a regex to extract names and emails from a forwarded email body, which will look similar to this always (real emails replaced by dummy emails):

> Begin forwarded message:
> Date: December 20, 2013 at 11:32:39 AM GMT-3
> Subject: My dummy subject
> From: Charlie Brown <aaa@aa-aaa.com>
> To: maria.brown@aaa.com, George Washington <george@washington.com>, =
thomas.jefferson@aaa.com, thomas.alva.edison@aaa.com, Juan =
<juan@aaa.com>, Alan <alan@aaa.com>, Alec <alec@aaa.com>, =
Alejandro <aaa@aaa.com>, Alex <aaa@planeas.com>, Andrea =
<andrea.mery@thomsen.cl>, Andrea <andrea.22@aaa.com>, Andres =
<andres@aaa.com>, Andres <avaldivieso@aaa.com>
> Hi,
> Please reply ASAP with your RSVP
> Bye

My first step was extracting all emails to a list with a custom function that I pass the whole email body to, like so:

def extract_emails(block_of_text):
 t = r'\b[a-zA-Z0-9.-]+@[a-zA-Z0-9.-]+\.[a-zA-Z0-9.-]+\b'
 return re.findall(t, block_of_text)

A couple of days ago I asked a question about extracting names using regex to help me build the function to extract all the names. My idea was to join both later on. I accepted an answer that performed what I asked, and came up with this other function:

def extract_names(block_of_text):
 p = r'[:,] ([\w ]+) \<'
 return re.findall(p, block_of_text)

My problem now was to make the extracted names match the extracted emails, mainly because sometimes there are less names than emails. So I thought, I could better try to build another regex to extract both names and emails,

This is my failed attempt to build such a regex.

[:,]([\w \<]+)([\w.-]+@[\w.-]+\.[\w.-]+)

REGEX101 LINK

Can anyone help and propose a nice, clean regex that grabs both name and email, to a list or dictionary of tuples? Thanks

EDIT: The expected output of the regex in Python would be a list like this:

 [(Charlie Brown', 'aaa@aaa.com'),('','maria.brown@aaa.com'),('George Washington', 'george@washington.com'),('','thomas.jefferson@aaa.com'),('','thomas.alva.edison@aaa.com'),('Juan','juan@aaa.com',('Alan', 'alan@aaa.com'), ('Alec', 'alec@aaa.com'),('Alejandro','aaa@aaa.com'),('Alex', 'aaa@aaa.com'),('Andrea','andrea.mery@thomsen.cl'),('Andrea','andrea.22@aaa.com',('Andres','andres@aaa.com'),('Andres','avaldivieso@aaa.com')] 
Community
  • 1
  • 1
newyuppie
  • 1,054
  • 1
  • 8
  • 13
  • http://regex101.com/#python – Joel Cornett Oct 26 '14 at 03:01
  • here's a rough sketch: `(\b\w+\b(?:\s+\b\w+\b)*)\s*<([\w.-]+@[\w.-]+)>|([\w.-]+@[\w.-]+)` – Joel Cornett Oct 26 '14 at 03:04
  • Are you only wanting to match-up the names that also have an email associated with them: i.e. `Alex `, but not `thomas.jefferson@aaa.com`? – l'L'l Oct 26 '14 at 03:25
  • no, if there is no name it should still be stored as a 'blank', the important part are the emails, and the names only for those emails that have them – newyuppie Oct 26 '14 at 03:31
  • @JoelCornett your sketch sort of works, I'm looking into it – newyuppie Oct 26 '14 at 03:32
  • @JoelCornett the only issue with your proposed Regex is that even though it matches correctly, it creates 3 capture groups instead of 2 (the third group is capturing emails that don't have an associated name)... Do you think that can be improved upon, so it only produces 2 groups? It looks like it's in the right direction though – newyuppie Oct 26 '14 at 03:46

1 Answers1

1

Seems like you want something like this.,

[:,]\s*=?\s*(?:([A-Z][a-z]+(?:\s[A-Z][a-z]+)?))?\s*=?\s*.*?([\w.]+@[\w.-]+)

DEMO

>>> import re
>>> s = """ > Begin forwarded message:
>=20
> Date: December 20, 2013 at 11:32:39 AM GMT-3
> Subject: My dummy subject
> From: Charlie Brown <aaa@aa-aaa.com>
> To: maria.brown@aaa.com, George Washington <george@washington.com>, =
thomas.jefferson@aaa.com, thomas.alva.edison@aaa.com, Juan =
<juan@aaa.com>, Alan <alan@aaa.com>, Alec <alec@aaa.com>, =
Alejandro <aaa@aaa.com>, Alex <aaa@planeas.com>, Andrea =
<andrea.mery@thomsen.cl>, Andrea <andrea.22@aaa.com>, Andres =
<andres@aaa.com>, Andres <avaldivieso@aaa.com>
> Hi,
> Please reply ASAP with your RSVP
> Bye"""
>>> re.findall(r'[:,]\s*=?\s*(?:([A-Z][a-z]+(?:\s[A-Z][a-z]+)?))?\s*=?\s*.*?([\w.]+@[\w.-]+)', s)
[('Charlie Brown', 'aaa@aa-aaa.com'), ('', 'maria.brown@aaa.com'), ('George Washington', 'george@washington.com'), ('', 'thomas.jefferson@aaa.com'), ('', 'thomas.alva.edison@aaa.com'), ('Juan', 'juan@aaa.com'), ('Alan', 'alan@aaa.com'), ('Alec', 'alec@aaa.com'), ('Alejandro', 'aaa@aaa.com'), ('Alex', 'aaa@planeas.com'), ('Andrea', 'andrea.mery@thomsen.cl'), ('Andrea', 'andrea.22@aaa.com'), ('Andres', 'andres@aaa.com'), ('Andres', 'avaldivieso@aaa.com')]
Avinash Raj
  • 172,303
  • 28
  • 230
  • 274
  • Pretty close, but I would actually need to have tuples of the email and corresponding name (a bit similar to the output produced by @JoelCornett's regex above on the comments) – newyuppie Oct 26 '14 at 03:54
  • What I mean, and I might not be using the correct terminology by "tuple", is that the output should look more like [('George Washington', 'george@washington.com'),('Alan', 'alan@aaa.com'), ('','thomas.jefferson@aaa.com'), etc...] – newyuppie Oct 26 '14 at 03:58
  • could you post the expected output in your question. What would be the expected tuples for the names and emails present in the From , To lines? – Avinash Raj Oct 26 '14 at 04:00
  • did you want to capture `maria.brown@aaa.com` ?, provide the full expected output.. – Avinash Raj Oct 26 '14 at 04:06
  • Avinash, I have edited to provide full expected output, thanks for making me clarify. Hopefully you can help me. Thanks – newyuppie Oct 26 '14 at 04:51
  • Excellent, it works perfectly for me, thank you for the fast and efficient response :) – newyuppie Oct 26 '14 at 22:10
  • Avinash, how could I update the regex so that it includes matching also the accented characters on the name part? (For example, it doesn't match a name like "Andrés"). Thanks for any tips! – newyuppie Nov 01 '14 at 12:47
  • could you ask it as a new question with a link to this? – Avinash Raj Nov 01 '14 at 12:50
  • http://stackoverflow.com/questions/26689565/capture-accented-characters-by-modifying-a-specific-regex-i-have-in-python3 – newyuppie Nov 01 '14 at 13:02
  • Avinash, I faced some situations where the email part was written in Pascal case. In those cases, the proposed regex does not correctly parse the groups. For example, just by changing "maria.brown@aaa.com" to "Maria.Brown@aaa.com" produces incorrect grouping. How to modify the regex to avoid this situation? Thanks – newyuppie Nov 02 '14 at 03:22
  • 1
    try this http://regex101.com/r/dN3kB4/14. Seems like you get all the help only from me to get the job done. How much did you pay for me? lol :-) – Avinash Raj Nov 02 '14 at 03:26
  • works! :) apparently you are the only expert regex-er over here! (it's a pay-it-forward, you got a lot of karma for helping me out ;) thanks a lot – newyuppie Nov 02 '14 at 03:39
  • Avinash, I need more help, on many of the strings I pass to the regex there are some strange artifacts... willing to pay :) give me instructions – newyuppie Nov 07 '14 at 03:23
  • just for fun. I didn't have even a paypal account :-) Just ask me, i'll help you if i could. – Avinash Raj Nov 07 '14 at 03:25
  • :) well, there are some quirks here and there depending on the text passed, for example, with some real emails I had to parse with this regex: - http://regex101.com/r/mZ7eF2/1 -> here, it does not pick up any of the names mentioned ("Antonella Sassi" for match 1, and "Pablo Ambram" for match 2). - http://regex101.com/r/zW1oG0/1 -> here, match 2 incorrectly picks up as a name the day of the week ("Thu"), and match 6 and 8 are not picked up entirely - http://regex101.com/r/lC0rN4/1 -> not focusing on the unicode thing, match 7 for example doesnt pick up the whole name, match 13 not picked up – newyuppie Nov 07 '14 at 15:29
  • would you prefer I open a new question for these adjustments? – newyuppie Nov 07 '14 at 15:34
  • i'm busy now. If you need an answer immediately then open a new question. – Avinash Raj Nov 07 '14 at 15:36
  • not urgent, I know we usually do this late nights so... no problem. I'll open up one anyway so you can get more points for helping me out. – newyuppie Nov 07 '14 at 15:37
  • Avinash, here's a link to the new question, when you have time later: http://stackoverflow.com/questions/26804989/python3-extracting-names-and-emails-as-groups-from-text-dump-using-only-regex-p thanks! – newyuppie Nov 07 '14 at 15:51
  • @newyuppie instead of asking new questions for the new updates, why don't you provide a large example which contains all the possibilities? Post it in the regex101 site and provide the link here. – Avinash Raj Nov 07 '14 at 16:03
  • Well to tell you the truth that's how I thought the site operated, since you had asked me to open a new question before for an update on Unicode, I thought it was how it was done :/ I'll as to it in a while thanks – newyuppie Nov 07 '14 at 17:45
  • yep, i asked because you stop asking after that. But the list goes long. So post all the possibilities in a single demo link. So that we could create a regex which matches all the possibilities. Sorry if the above comment hurts you. – Avinash Raj Nov 07 '14 at 17:48