how do I best validate email in pandas data frame

Question

I have a data frame (df) with emails and numbers like

    email                          euro
0   firstname@firstdomain.com      150
1   secondname@seconddomain.com     50
2   thirdname@thirddomain.com      300
3   kjfslkfj                         0
4   fourthname@fourthdomain.com    200

I need to filter all rows with correct emails and euro equal to or greater than 100 and another list with correct emails and euro lower than 100. I know that I can filter by euro like this

df_gt_100 = df.euro >= 100

and

df_lt_100 = df.euro < 100

But I can't find a way to filter the email addresses. I imported the email_validate package and tried things like this

validate_email(df.email)

which gives me a TypeError: expected string or bytes-like object.

Can anyone pls give me a hint how to approach this issue. It'd be nice if I could do this all in one filter with the AND and OR operators.

Thanks in advance, Manuel

Sorry, but all solve my problem. Which leads to a "new" issue, which is the best one. :-) Thanks a lot everybody I really appreciate your help. — Manuel, May 03 '18 at 19:05

jezrael · Answer 1 · 2018-05-03T07:31:33.867

10

Use apply, chain mask by & for AND and filter by boolean indexing:

from validate_email import validate_email

df1 = df[(df['euro'] > 100) & df['email'].apply(validate_email)]
print (df1)
                         email  euro
0    firstname@firstdomain.com   150
2    thirdname@thirddomain.com   300
4  fourthname@fourthdomain.com   200

Another approach with regex and contains:

df1 = df[(df['euro'] > 100) &df['email'].str.contains(r'[^@]+@[^@]+\.[^@]+')]
print (df1)
                         email  euro
0    firstname@firstdomain.com   150
2    thirdname@thirddomain.com   300
4  fourthname@fourthdomain.com   200

edited May 03 '18 at 07:31

answered May 03 '18 at 07:24

jezrael

822,522
95
1,334
1,252

Thank you very much for your answer and solution. Would you recommend one solution over the other in terms of speed etc.? – Manuel May 03 '18 at 15:09
@Manuel - Hard question, because answers similars. Maybe the main advatage is simpliest first and second solution, but if it is faster the best check in real data. – jezrael May 03 '18 at 15:13
There's one more question coming up regarding this topic. How do I get the rows which don't match to a valid email or the right amount of euro? I don't want to loose them, I need them to be in another data frame. – Manuel May 04 '18 at 21:04
@jezrael - I keep getting this error when using: `df1 = df[df['email'].apply(validate_email)]` - TypeError: 'module' object is not callable - Any idea where this could come from. "email" is of dtype object – Manuel May 04 '18 at 21:23

score 5 · Accepted Answer · answered May 03 '18 at 07:23

5

In [30]: from validate_email import validate_email

In [31]: df
Out[31]: 
                       email
0  firstname@firstdomain.com
1                   kjfslkfj

In [32]: df['is_valid_email'] = df['email'].apply(lambda x:validate_email(x))

In [33]: df
Out[33]: 
                       email  is_valid_email
0  firstname@firstdomain.com            True
1                   kjfslkfj           False

In [34]: df['email'][df['is_valid_email']]
Out[34]: 
0    firstname@firstdomain.com

answered May 03 '18 at 07:23

Amit Basuri

563
5
12

Hi! When I try to execute the following error occurred! `EmailUndeliverableError("The domain name %s does not exist." % domain_i18n) EmailUndeliverableError: The domain name darakjy.org does not exist` – Indi Feb 21 '20 at 06:11

score 3 · Answer 3 · answered May 03 '18 at 07:09

3

You can use regex expressions to find a match and then use apply on the email column to create a T/F column for where an email exists:

import re
import pandas as pd

pattern = re.compile(r"(^[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+$)")  # this is the regex expression to search on

df = pd.DataFrame({'email': ['firstname@domain.com', 'avicii@heaven.com', 'this.is.a.dot@email.com', 'email1234@112.com', 'notanemail'], 'euro': [123, 321, 150, 0, 133]})
df['isemail'] = df['email'].apply(lambda x: True if pattern.match(x) else False)

Result:

                     email  euro  isemail
0     firstname@domain.com   123     True
1        avicii@heaven.com   321     True
2  this.is.a.dot@email.com   150     True
3        email1234@112.com     0     True
4               notanemail   133    False

Now you can filter on the isemail column.

answered May 03 '18 at 07:09

Scratch'N'Purr

9,959
2
35
51

I would favour a list comprehension over apply, and you can simplify your code to bool(pattern.match(...)). – cs95 May 03 '18 at 11:26
@cᴏʟᴅsᴘᴇᴇᴅ Thanks for the feedback! I never used the bool() approach before. Always learning something new on SO! – Scratch'N'Purr May 03 '18 at 11:39
Thank you very much for your answer and solution. I don't really understand what this lambda thing does, though. – Manuel May 03 '18 at 14:59
No problem! Lambda functions are just a lazy way to use anonymous functions without defining the function. I could have defined a function such as `def validate(x):` with the code block as `return True if pattern.match(x) else False` and passed it to apply `df['email'].apply(validate)`. – Scratch'N'Purr May 03 '18 at 15:11
Great, thanks a lot for taking your time and explaining this to me. – Manuel May 03 '18 at 19:07

score 1 · Answer 4 · answered May 03 '18 at 07:04

1

validate_email returns a lot of information e.g. smtp etc., and for invalid emails, it throws an EmailNotValidError exception. You can write a function, and apply on pandas series -

from email_validator import validate_email, EmailNotValidError

def validate_e(x):
    try:
        v = validate_email(x)
        return True
    except EmailNotValidError as e:
        return False

df["Email_validate"] = df['email'].apply(validate_e)

answered May 03 '18 at 07:04

Aritesh

1,985
1
13
17

... and then `df_fltrd = df[df['Email_validate'] & (df['euro'] > 100)]` – SpghttCd May 03 '18 at 07:10
Hi! What is the purpose of assigning validate_email(x) to v? – Indi Feb 21 '20 at 06:34

how do I best validate email in pandas data frame

4 Answers4

Linked

Related