5

I have a data frame (df) with emails and numbers like

    email                          euro
0   firstname@firstdomain.com      150
1   secondname@seconddomain.com     50
2   thirdname@thirddomain.com      300
3   kjfslkfj                         0
4   fourthname@fourthdomain.com    200

I need to filter all rows with correct emails and euro equal to or greater than 100 and another list with correct emails and euro lower than 100. I know that I can filter by euro like this

df_gt_100 = df.euro >= 100

and

df_lt_100 = df.euro < 100

But I can't find a way to filter the email addresses. I imported the email_validate package and tried things like this

validate_email(df.email)

which gives me a TypeError: expected string or bytes-like object.

Can anyone pls give me a hint how to approach this issue. It'd be nice if I could do this all in one filter with the AND and OR operators.

Thanks in advance, Manuel

Manuel
  • 143
  • 2
  • 9
  • Sorry, but all solve my problem. Which leads to a "new" issue, which is the best one. :-) Thanks a lot everybody I really appreciate your help. – Manuel May 03 '18 at 19:05

4 Answers4

10

Use apply, chain mask by & for AND and filter by boolean indexing:

from validate_email import validate_email

df1 = df[(df['euro'] > 100) & df['email'].apply(validate_email)]
print (df1)
                         email  euro
0    firstname@firstdomain.com   150
2    thirdname@thirddomain.com   300
4  fourthname@fourthdomain.com   200

Another approach with regex and contains:

df1 = df[(df['euro'] > 100) &df['email'].str.contains(r'[^@]+@[^@]+\.[^@]+')]
print (df1)
                         email  euro
0    firstname@firstdomain.com   150
2    thirdname@thirddomain.com   300
4  fourthname@fourthdomain.com   200
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • Thank you very much for your answer and solution. Would you recommend one solution over the other in terms of speed etc.? – Manuel May 03 '18 at 15:09
  • @Manuel - Hard question, because answers similars. Maybe the main advatage is simpliest first and second solution, but if it is faster the best check in real data. – jezrael May 03 '18 at 15:13
  • There's one more question coming up regarding this topic. How do I get the rows which don't match to a valid email or the right amount of euro? I don't want to loose them, I need them to be in another data frame. – Manuel May 04 '18 at 21:04
  • @jezrael - I keep getting this error when using: `df1 = df[df['email'].apply(validate_email)]` - TypeError: 'module' object is not callable - Any idea where this could come from. "email" is of dtype object – Manuel May 04 '18 at 21:23
5
In [30]: from validate_email import validate_email

In [31]: df
Out[31]: 
                       email
0  firstname@firstdomain.com
1                   kjfslkfj

In [32]: df['is_valid_email'] = df['email'].apply(lambda x:validate_email(x))

In [33]: df
Out[33]: 
                       email  is_valid_email
0  firstname@firstdomain.com            True
1                   kjfslkfj           False

In [34]: df['email'][df['is_valid_email']]
Out[34]: 
0    firstname@firstdomain.com
Amit Basuri
  • 563
  • 5
  • 12
  • Hi! When I try to execute the following error occurred! `EmailUndeliverableError("The domain name %s does not exist." % domain_i18n) EmailUndeliverableError: The domain name darakjy.org does not exist` – Indi Feb 21 '20 at 06:11
3

You can use regex expressions to find a match and then use apply on the email column to create a T/F column for where an email exists:

import re
import pandas as pd

pattern = re.compile(r"(^[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+$)")  # this is the regex expression to search on

df = pd.DataFrame({'email': ['firstname@domain.com', 'avicii@heaven.com', 'this.is.a.dot@email.com', 'email1234@112.com', 'notanemail'], 'euro': [123, 321, 150, 0, 133]})
df['isemail'] = df['email'].apply(lambda x: True if pattern.match(x) else False)

Result:

                     email  euro  isemail
0     firstname@domain.com   123     True
1        avicii@heaven.com   321     True
2  this.is.a.dot@email.com   150     True
3        email1234@112.com     0     True
4               notanemail   133    False

Now you can filter on the isemail column.

Scratch'N'Purr
  • 9,959
  • 2
  • 35
  • 51
  • I would favour a list comprehension over apply, and you can simplify your code to bool(pattern.match(...)). – cs95 May 03 '18 at 11:26
  • @cᴏʟᴅsᴘᴇᴇᴅ Thanks for the feedback! I never used the bool() approach before. Always learning something new on SO! – Scratch'N'Purr May 03 '18 at 11:39
  • Thank you very much for your answer and solution. I don't really understand what this lambda thing does, though. – Manuel May 03 '18 at 14:59
  • No problem! Lambda functions are just a lazy way to use anonymous functions without defining the function. I could have defined a function such as `def validate(x):` with the code block as `return True if pattern.match(x) else False` and passed it to apply `df['email'].apply(validate)`. – Scratch'N'Purr May 03 '18 at 15:11
  • Great, thanks a lot for taking your time and explaining this to me. – Manuel May 03 '18 at 19:07
1

validate_email returns a lot of information e.g. smtp etc., and for invalid emails, it throws an EmailNotValidError exception. You can write a function, and apply on pandas series -

from email_validator import validate_email, EmailNotValidError

def validate_e(x):
    try:
        v = validate_email(x)
        return True
    except EmailNotValidError as e:
        return False

df["Email_validate"] = df['email'].apply(validate_e)
Aritesh
  • 1,985
  • 1
  • 13
  • 17