How to find special characters from Python Data frame

Question

I need to find special characters from entire dataframe.

In below data frame some columns contains special characters, how to find the which columns contains special characters?

Want to display text for each columns if it contains special characters.

@Sqoshu OK, It would be great if you provide some code example? — Learnings, Jul 11 '18 at 14:25
Please take time to read this post on [how to provide a great pandas example](http://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples) as well as how to provide a [minimal, complete, and verifiable example](http://stackoverflow.com/help/mcve) and revise your question accordingly. These tips on [how to ask a good question](http://stackoverflow.com/help/how-to-ask) may also be useful. — Zero, Jul 11 '18 at 14:35

score 7 · Accepted Answer · answered Jul 11 '18 at 14:41

You can setup an alphabet of valid characters, for example

import string
alphabet = string.ascii_letters+string.punctuation

Which is

'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

And just use

df.col.str.strip(alphabet).astype(bool).any()

For example,

df = pd.DataFrame({'col1':['abc', 'hello?'], 'col2': ['ÃÉG', 'Ç']})


    col1    col2
0   abc     ÃÉG
1   hello?  Ç

Then, with the above alphabet,

df.col1.str.strip(alphabet).astype(bool).any()
False
df.col2.str.strip(alphabet).astype(bool).any()
True

The statement special characters can be very tricky, because it depends on your interpretation. For example, you might or might not consider # to be a special character. Also, some languages (such as Portuguese) may have chars like ã and é but others (such as English) will not.

thanks, you saved me. How to display col values if it contains special characters. — Learnings, Jul 13 '18 at 06:11
how to remove it? Need to Strip special characters. ~df.col2.str.strip(alphabet) this not working — Learnings, Jul 13 '18 at 07:24

Plinus · Answer 2 · 2018-07-16T12:41:26.070

7

To remove unwanted characters from dataframe columns, use regex:

def strip_character(dataCol):
    r = re.compile(r'[^a-zA-Z !@#$%&*_+-=|\:";<>,./()[\]{}\']')
    return r.sub('', dataCol)

df[resultCol] = df[dataCol].apply(strip_character)

edited Jul 16 '18 at 12:41

answered Jul 16 '18 at 12:38

Plinus

308
1
3
10

score 1 · Answer 3 · edited Oct 16 '21 at 20:26

1

# Whitespaces also could be considered in some cases.

import string

unwanted = string.ascii_letters + string.punctuation + string.whitespace
print(unwanted)

# This helped me extract '10' from '10+ years'.

df.col = df.col.str.strip(unwanted)

edited Oct 16 '21 at 20:26

pzaenger

11,381
3
45
46

answered Oct 16 '21 at 20:05

Indrajit Bhatkar

11
2

How to find special characters from Python Data frame

3 Answers3