3

I have DataFrame (pandas):

data1 = pandas.DataFrame(['привет, Вася', 'как дела?', 'уиии!!'])

As you see it contains unicode symbols (cyrillic):

>>> data1
              0
0  привет, Вася
1     как дела?
2        уиии!!

I try to delete all specific symbols from dataframe column. I tryied:

data1.apply(replace ???)
data1[0].replace()

and even something with lambda. But I dont know how to call replace correctly. So as I want to show all symbols must be deleted by range:

x in '!@#$%^&*()'

or

if chr(x) not in range(1040,1072) # chr() of cyrillic
Mikhail_Sam
  • 10,602
  • 11
  • 66
  • 102

3 Answers3

6

You can use unicode RegEx (?u):

Source DF:

In [30]: df
Out[30]:
                        col
0              привет, Вася
1                 как дела?
2              уиии 23 45!!
3  давай Вася, до свидания!

Solution (removing all digits, all trailing spaces and all non-characters, except spaces and question mark):

In [36]: df.replace(['\d+', r'(?u)[^\w\s\?]+', '\s*$'], ['','',''], regex=True)
Out[36]:
                      col
0             привет Вася
1               как дела?
2                    уиии
3  давай Вася до свидания

RegEx explained ...

MaxU - stand with Ukraine
  • 205,989
  • 36
  • 386
  • 419
5

Okay, IIUC, use string.punctuation and perform replacement with replace -

import string
data1.replace(r'[{}]'.format(string.punctuation), '', regex=True)

             0
0  привет Вася
1     как дела
2         уиии 

Where,

string.punctuation
'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

If you want to exclude a particular character / set of chars, here's one way to do it, using set.difference -

c = set(string.punctuation)
p_to_exclude = ['?', ...]

c = c.difference(p_to_exclude)

Now, you can use c as before -

data1.replace(r'[{}]'.format(re.escape(''.join(c))), '', regex=True)
             0
0  привет Вася
1    как дела?
2         уиии

Another thing here is to use re.escape, because [ and ] are considered metacharacters, and need to be escaped.

cs95
  • 379,657
  • 97
  • 704
  • 746
  • Awesome! So can I use `data1[0].replace` to replace just one column? Can you please clarify a bit your answer. What is mean `r''`? And one more detail - what if I want delete all the punctuations exept `?` for exampl? – Mikhail_Sam Dec 15 '17 at 12:34
  • @Mikhail_Sam “r” is the raw string. If you want to perform replacement on one column, you can index it with iloc or [..] and the call replace or str.replace. Also, if you want to exclude a particular character, you can use str.replace on the punctuation string. – cs95 Dec 15 '17 at 12:42
  • It’s considered good practice to use raw strings, especially with regex. – cs95 Dec 15 '17 at 12:44
  • @Mikhail_Sam Okay, see my edit. – cs95 Dec 15 '17 at 12:46
  • Thank you for so detailed answer! – Mikhail_Sam Dec 15 '17 at 13:13
3

Perhaps your looking for substitution, [!@...] here is equivalent to (! or @...) i.e :

data1[0].str.replace('[!@#$%^&*()]','')

0    привет, Вася
1       как дела?
2            уиии
Name: 0, dtype: object

If you to replace the puntuations all across the dataframe then go for

 data1.replace('[!@#$%^&*()]','',regex=True)

Based on the comment, the regex you might be looking for

data1.replace('[^\w\s]','',regex=True) 
Bharath M Shetty
  • 30,075
  • 6
  • 57
  • 108