How can i filter bad chars in Pandas DataFrame?

Question

I have a code like this. Here, I want to delete the bad characters in the mailing list. I tried to apply it with the bad_chars() function.

import pandas as pd
import numpy as np



excelRead = pd.read_excel('mailing.xlsx')
excelRead.dropna(inplace= True)
badCharsList = ["ü", "ı", "ö", "ç", "ş", "ğ", "!", "#", "$", "%", "&", "'", "*", "+",  "/", "=", "?" "^", "`", "{", "|", "}", "~", "(",")",",",":",";","<",">","[", " "]

def bad_chars(x):
    for i in badCharsList:
        if i.lower() not in x.lower():
            return i
        else:
            return np.nan

excelTest = excelRead[excelRead['mails'].str.endswith("@gmail.com", na=False) | excelRead['mails'].str.endswith("@hotmail.com", na=False) | excelRead['mails'].str.endswith("@outlook.com", na=False) | excelRead['mails'].str.endswith("@icloud.com", na=False) | excelRead['mails'].str.endswith("@windowslive.com", na=False) | excelRead['mails'].str.endswith("@yandex.com", na=False) | excelRead['mails'].str.endswith("@mynet.com", na=False) | excelRead['mails'].str.endswith("@hotmail.com.tr", na=False) | excelRead['mails'].str.endswith("@yahoo.com", na=False)]

lower = excelTest['mails'].str.lower()
testBad = excelTest['mails'].apply(bad_chars)

print(testBad)

But this is the output I got. Where do you think I went wrong or how can I achieve this?

Output

0         ü
1         ü
2         ü
3         ü
4         ü
         ..
107808    ü
107809    ü
107810    ü
107811    ü
107812    ü
Name: mails, Length: 104507, dtype: object

before bad_chars() function sample output:

0            okanmercannn@hotmail.com
1         06hvm42hotmailcom@gmail.com
2              adanasenol01@gmail.com
3          sezersenturk6305@gmail.com
4                alyasu1903@gmail.com
                     ...
107808        elifyucel2566@gmail.com
107809        yayla19871987@gmail.com
107810         zeynepyilkus@gmail.com
107811    pathoss_theodra@hotmail.com
107812           ziver.7340@gmail.com

Please give a sample of your main dataframe and expected output. — Nuri Taş, Sep 07 '22 at 11:19

Josh Friedlander · Answer 1 · 2022-09-07T11:52:20.207

1

You can use contains with a regular expression that checks for a match for any char, |. We must take care to escape the chars that have special meaning in regular expressions. (I based this on this answer.)

import re

email_domains = ["gmail.com","hotmail.com","outlook.com","icloud.com",
                 "windowslive.com","yandex.com", "mynet.com","hotmail.com.tr",
                 "yahoo.com"]

# check if domain is in domain list  
excelTest = excelRead[excelRead['mails'].str.split("@").str[-1].isin(email_domains)]

# look for bad chars with a regex
regex = '|'.join(re.escape(s.lower()) for s in badCharsList)
excelTest = excelTest[~excelTest['mails'].str.lower().str.contains(regex)]

edited Sep 07 '22 at 11:52

answered Sep 07 '22 at 11:27

Josh Friedlander

10,870
5
35
75

its not working – Ozans Sep 07 '22 at 11:47
Sorry, was missing a negation. Try now – Josh Friedlander Sep 07 '22 at 11:54

How can i filter bad chars in Pandas DataFrame?

1 Answers1