Python extract email address from a HUGE string

Question

I have been using this: (I know, there are probably more efficient ways...)

Given this in an email message:

Submitted data:
First Name: MyName
Your Email Address: email@domain.com
TAG:

I coded this:

intStart = (bodystring.rfind('First ')) + 12
intEnd = (bodystring.rfind('Your Email'))
receiver_name = bodystring[intStart:intEnd]

intStart = (bodystring.rfind('Your Email Address: ')) + 20
intEnd = (bodystring.rfind('TAG:'))
receiver_email = bodystring[intStart:intEnd]

... and got what I needed. This worked because I had the 'TAG' label.

Now I am given this:

Submitted data:
First name: MyName
Last name:
Email: email@domain.com

I'm having a brain block on getting the email address without a next word. There is whitespace. Can someone nudge me in the right direction? I suspect I can dig out the email address after the occurrence of 'Email:' using regex...

If your input is always in this structure, you can split on spaces and grab the last item. `bodystring.split(' ')[-1]` — C_Z_, May 06 '21 at 17:20
"There is whitespace" - wait, does that mean you have another whitespace _after_ the email address, but _not_ another word ? — TheEagle, May 06 '21 at 17:21
Is the formatting correct? Is the data actually split into lines like that? — Mad Physicist, May 06 '21 at 17:26

score 2 · Answer 1 · answered May 06 '21 at 17:24

You can, in fact, make use of RegEx to extract e-mails.

To find single e-mails in a text, you can make use of re.search().group()
In case you want to find multiple emails, you can make use of re.findall()

An example

    import re
    text = "First name: MyName Last name: Email: email@domain.com "
    
    email = re.search(r"[a-z0-9\.\-+_]+@[a-z0-9\.\-+_]+\.[a-z]+", text)
    print(email.group())
    
    emails = re.findall(r"[a-z0-9\.\-+_]+@[a-z0-9\.\-+_]+\.[a-z]+", text)
    print (emails)

This would give the output as

email@domain.com
['email@domain.com']

Mad Physicist · Answer 2 · 2021-05-06T19:18:14.627

Searching for strings is often better done with splitting, and occasionally regular expressions. So first split the lines:

bodylines = bodystring.splitlines()

Split the resulting lines on the : delimiter (make a generator):

chunks = (line.split(':') for line in bodylines)

Now grab the first one that has "email" on the left and @ on the right:

address = next(val.strip() for key, val in chunks if 'email' in key.lower() and '@' in val)

If you want all the emails across multiple lines, replace next with a list comprehension:

addresses = [val.strip() for key, val in chunks if 'email' in key.lower() and '@' in val]

This can be done in one line with no imports (if you replace chunks with its definition, not that I recommend it). Regex are a much heavier tool that allow you to specify much more general patterns, but are also much slower as a result. If you can get away with simple and effective tools, do it: don't bring in the sledgehammer until you need it!

score 1 · Answer 3 · answered May 06 '21 at 19:12

If the email should come after the word Email followed by a :, you could match the Name part, and capture the email in a group with an email like pattern.

\bEmail[^:]*:\s*([^\s@]+@[^\s@]+)

\bEmail A word boundary to prevent a partial match, match Email
[^:]*:\s* Match optional chars other than :, then match : and optional whitespace chars
( Capture group 1
- [^\s@]+@[^\s@]+ Match a single @ between 1+ more non whitespace chars ecluding the @ itself
) Close group 1

Regex demo

Example with re.findall that returns the values of the capture groups:

import re
 
regex = r"\bEmail[^:]*:\s*([^\s@]+@[^\s@]+)"
 
s = ("Submitted data:\n"
    "First Name: MyName\n"
    "Your Email Address: email@domain.com\n"
    "TAG:\n\n"
    "Submitted data:\n"
    "First name: MyName\n"
    "Last name:\n"
    "Email: email@domain.com")
 
print(re.findall(regex, s))

Output

['email@domain.com', 'email@domain.com']

Python extract email address from a HUGE string

3 Answers3