I asked a question before about extracting matching groups of names & emails from a long email body text into tuples using a regex. Solution worked beautifully extracting names and emails for example from this text:
> Begin forwarded message:
> Date: December 20, 2013 at 11:32:39 AM GMT-3
> Subject: My dummy subject
> From: Charlie Brown <aaa@aa-aaa.com>
> To: maria.brown@aaa.com, George Washington <george@washington.com>, =
thomas.jefferson@aaa.com, thomas.alva.edison@aaa.com, Juan =
<juan@aaa.com>, Alan <alan@aaa.com>, Alec <alec@aaa.com>, =
Alejandro <aaa@aaa.com>, Alex <aaa@planeas.com>, Andrea =
<andrea.mery@thomsen.cl>, Andrea <andrea.22@aaa.com>, Andres =
<andres@aaa.com>, Andres <avaldivieso@aaa.com>
> Hi,
> Please reply ASAP with your RSVP
> Bye
And using this Regex:
[:,]\s*=?\s*(?:([A-Z][a-z]+(?:\s[A-Z][a-z]+)?))?\s*=?\s*.*?([\w.]+@[\w.-]+)
Producing this output:
[(Charlie Brown', 'aaa@aaa.com'),('','maria.brown@aaa.com'),('George Washington', 'george@washington.com'),('','thomas.jefferson@aaa.com'),('','thomas.alva.edison@aaa.com'),('Juan','juan@aaa.com',('Alan', 'alan@aaa.com'), ('Alec', 'alec@aaa.com'),('Alejandro','aaa@aaa.com'),('Alex', 'aaa@aaa.com'),('Andrea','andrea.mery@thomsen.cl'),('Andrea','andrea.22@aaa.com',('Andres','andres@aaa.com'),('Andres','avaldivieso@aaa.com')]
But, I stumbled upon instances where the names on the texts I passed to the regex have special accented characters. How can I update the regex above to not break and also capture names that include accented characters like:
"á", "é", "í", "ó", "ú", "ç", "ö", "ü", "ñ", "à", "è", "ì", "ò", "ù"
(and their UPPER counterparts)
Thanks!