Delete specific symbols (unicode) from Pandas DataFrame Column

Question

I have DataFrame (pandas):

data1 = pandas.DataFrame(['привет, Вася', 'как дела?', 'уиии!!'])

As you see it contains unicode symbols (cyrillic):

>>> data1
              0
0  привет, Вася
1     как дела?
2        уиии!!

I try to delete all specific symbols from dataframe column. I tryied:

data1.apply(replace ???)
data1[0].replace()

and even something with lambda. But I dont know how to call replace correctly. So as I want to show all symbols must be deleted by range:

x in '!@#$%^&*()'

or

if chr(x) not in range(1040,1072) # chr() of cyrillic

What is your expected output? Anything that is not cyrillic? — cs95, Dec 15 '17 at 12:24
@cᴏʟᴅsᴘᴇᴇᴅ , no, I want next: `['привет Вася' , 'как дела' , 'уиии' ] `. Delete all specific symbols (as !@#$'"). And I want to show them in some usable view (for example using `range`) — Mikhail_Sam, Dec 15 '17 at 12:26
@Mikhail_Sam, could you define `specific symbols` or better yet post your desired output? — MaxU - stand with Ukraine, Dec 15 '17 at 12:28
@MaxU yep, I want to define them in some usable view. For example: `if ord(x) in range(0-64) or x in range(91-96)` then delete them — Mikhail_Sam, Dec 15 '17 at 12:30
@cᴏʟᴅsᴘᴇᴇᴅ Sorry, that was missclick enter :) I edited previous comment — Mikhail_Sam, Dec 15 '17 at 12:30

MaxU - stand with Ukraine · Accepted Answer · 2017-12-15T12:56:48.597

6

You can use unicode RegEx (?u):

Source DF:

In [30]: df
Out[30]:
                        col
0              привет, Вася
1                 как дела?
2              уиии 23 45!!
3  давай Вася, до свидания!

Solution (removing all digits, all trailing spaces and all non-characters, except spaces and question mark):

In [36]: df.replace(['\d+', r'(?u)[^\w\s\?]+', '\s*$'], ['','',''], regex=True)
Out[36]:
                      col
0             привет Вася
1               как дела?
2                    уиии
3  давай Вася до свидания

RegEx explained ...

edited Dec 15 '17 at 12:56

answered Dec 15 '17 at 12:38

MaxU - stand with Ukraine

205,989
36
386
419

Velikolepno! Can you please clarify, what `(?u)` and `\w\s` mean? And how to delete all specific symbols exept `?` ? – Mikhail_Sam Dec 15 '17 at 12:42
`(?u)` is new to me – Bharath M Shetty Dec 15 '17 at 12:43
just last question: how to delete numbers 0-9 too? – Mikhail_Sam Dec 15 '17 at 12:51
@Mikhail_Sam, i've extended my answer - please check... – MaxU - stand with Ukraine Dec 15 '17 at 12:58
@Mikhail_Sam, you are welcome! :) – MaxU - stand with Ukraine Dec 15 '17 at 13:13

cs95 · Answer 2 · 2017-12-15T12:48:57.850

5

Okay, IIUC, use string.punctuation and perform replacement with replace -

import string
data1.replace(r'[{}]'.format(string.punctuation), '', regex=True)

             0
0  привет Вася
1     как дела
2         уиии

Where,

string.punctuation
'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

If you want to exclude a particular character / set of chars, here's one way to do it, using set.difference -

c = set(string.punctuation)
p_to_exclude = ['?', ...]

c = c.difference(p_to_exclude)

Now, you can use c as before -

data1.replace(r'[{}]'.format(re.escape(''.join(c))), '', regex=True)
             0
0  привет Вася
1    как дела?
2         уиии

Another thing here is to use re.escape, because [ and ] are considered metacharacters, and need to be escaped.

edited Dec 15 '17 at 12:48

answered Dec 15 '17 at 12:30

cs95

379,657
97
704
746

Awesome! So can I use `data1[0].replace` to replace just one column? Can you please clarify a bit your answer. What is mean `r''`? And one more detail - what if I want delete all the punctuations exept `?` for exampl? – Mikhail_Sam Dec 15 '17 at 12:34
@Mikhail_Sam “r” is the raw string. If you want to perform replacement on one column, you can index it with iloc or [..] and the call replace or str.replace. Also, if you want to exclude a particular character, you can use str.replace on the punctuation string. – cs95 Dec 15 '17 at 12:42
It’s considered good practice to use raw strings, especially with regex. – cs95 Dec 15 '17 at 12:44
@Mikhail_Sam Okay, see my edit. – cs95 Dec 15 '17 at 12:46
Thank you for so detailed answer! – Mikhail_Sam Dec 15 '17 at 13:13

Bharath M Shetty · Answer 3 · 2017-12-15T12:32:19.237

3

Perhaps your looking for substitution, [!@...] here is equivalent to (! or @...) i.e :

data1[0].str.replace('[!@#$%^&*()]','')

0    привет, Вася
1       как дела?
2            уиии
Name: 0, dtype: object

If you to replace the puntuations all across the dataframe then go for

 data1.replace('[!@#$%^&*()]','',regex=True)

Based on the comment, the regex you might be looking for

data1.replace('[^\w\s]','',regex=True)

edited Dec 15 '17 at 12:32

answered Dec 15 '17 at 12:25

Bharath M Shetty

30,075
6
57
108

1

Nice, but `str.replace` works on one column at a time. I'm guessing you'd want `replace`? – cs95 Dec 15 '17 at 12:31
I dont think one could have data strings all over the dataframe . – Bharath M Shetty Dec 15 '17 at 12:33
Yep you are right I need to work on just several columns. Can you please clarify, what `\w\s` mean? And in first your method - what if I want to delete `[` and `]` symbols too? – Mikhail_Sam Dec 15 '17 at 12:39
Use the escape character like `\[`, and `^` is not, `\w` is the for only letters and numbers , all special characters will be neglected , `\s` is for space. Im removing everything other than `\w` and `\s` – Bharath M Shetty Dec 15 '17 at 12:40
Thank you for clarification! One more question, if you let: how to delete numbers 0-9 too? – Mikhail_Sam Dec 15 '17 at 12:53
@Mikhail_Sam I forgot you have characters other than english alphabets. MaxU already edited his answer for that. – Bharath M Shetty Dec 15 '17 at 13:07
Thank you for help! – Mikhail_Sam Dec 15 '17 at 13:13

Delete specific symbols (unicode) from Pandas DataFrame Column

3 Answers3