Remove special characters in pandas dataframe

Question

This seems like an inherently simple task but I am finding it very difficult to remove the '' from my entire data frame and return the numeric values in each column, including the numbers that did not have ''. The dateframe includes hundreds of more columns and looks like this in short:

Time            A1      A2
2.0002546296    1499    1592
2.0006712963    1252    1459
2.0902546296    1731    2223
2.0906828704    1691    1904
2.1742245370    2364    3121
2.1764699074    2096    1942
2.7654050926    *7639*  *8196*
2.7658564815    *7088*  *7542*
2.9048958333    *8736*  *8459*
2.9053125000    *7778*  *7704*
2.9807175926    *6612*  *6593*
3.0585763889    *8520*  *9122*

I have not written it to iterate over every column in df yet but as far as the first column goes I have come up with this

df['A1'].str.replace('*','').astype(float)

which yields

0        NaN
1        NaN
2        NaN
3        NaN
4        NaN
5        NaN
6        NaN
7        NaN
8        NaN
9        NaN
10       NaN
11       NaN
12       NaN
13       NaN
14       NaN
15       NaN
16       NaN
17       NaN
18       NaN
19    7639.0
20    7088.0
21    8736.0
22    7778.0
23    6612.0
24    8520.0

Is there a very easy way to just remove the '*' in the dataframe in pandas?

shivsn · Answer 1 · 2016-07-09T17:56:27.080

19

use replace which applies on whole dataframe :

df
Out[14]: 
       Time      A1      A2
0  2.000255    1499    1592
1  2.176470    2096    1942
2  2.765405  *7639*  *8196*
3  2.765856  *7088*  *7542*
4  2.904896  *8736*  *8459*
5  2.905312  *7778*  *7704*
6  2.980718  *6612*  *6593*
7  3.058576  *8520*  *9122*

df=df.replace('\*','',regex=True).astype(float)

df
Out[16]: 
       Time    A1    A2
0  2.000255  1499  1592
1  2.176470  2096  1942
2  2.765405  7639  8196
3  2.765856  7088  7542
4  2.904896  8736  8459
5  2.905312  7778  7704
6  2.980718  6612  6593
7  3.058576  8520  9122

edited Jul 09 '16 at 17:56

answered Jul 09 '16 at 03:17

shivsn

7,680
1
26
33

4

Add `df = df.astype(float)` after the replace and you've got it. I'd skip `inplace` and just do `df = df.replace('\*', '', regex=True).astype(float)` and call it good. – piRSquared Jul 09 '16 at 07:51
1

@piRSquared is it a bad practice using `inplace`. – shivsn Jul 09 '16 at 17:55
1

No. Just preference. In this case, when you used inplace it did not return itself allowing for chaining. Without it, I was able to chain astype(float) – piRSquared Jul 09 '16 at 18:45

score 4 · Answer 2 · edited May 04 '23 at 23:05

4

I found this to be a simple approach - Use replace to retain only the digits (and dot and minus sign).

This would remove characters, alphabets or anything that is not defined in to_replace attribute.

So, the solution is:

df['A1'].replace(regex=True, inplace=True, to_replace=r'[^0-9.\-]', value=r'')
df['A1'] = df['A1'].astype(float64)

edited May 04 '23 at 23:05

InSync

4,851
4
8
30

answered Mar 28 '18 at 14:18

CuriousCoder

491
5
9

score 3 · Answer 3 · edited May 23 '17 at 11:54

There is another solution which uses map and strip functions. You can see the below link: Pandas DataFrame: remove unwanted parts from strings in a column.

df = 
    Time     A1     A2
0   2.0     1258    *1364*
1   2.1     *1254*  2002
2   2.2     1520    3364
3   2.3     *300*   *10056*

cols = ['A1', 'A2']
for col in cols:
    df[col] = df[col].map(lambda x: str(x).lstrip('*').rstrip('*')).astype(float)

df = 
    Time     A1     A2
0   2.0     1258    1364
1   2.1     1254    2002
2   2.2     1520    3364
3   2.3     300     10056

The parsing procedure only be applied on the desired columns.

score 3 · Answer 4 · edited Jun 27 '19 at 09:15

3

I found the answer of CuriousCoder so brief and useful but there must be a ')' instead of ']' So it should be:

df['A1'].replace(regex=True, inplace=True, to_replace=r'[^0-9.\-]',
value=r''] df['A1'] = df['A1'].astype(float64)

edited Jun 27 '19 at 09:15

PV8

5,799
7
43
87

answered Jun 27 '19 at 09:01

Đoàn Phương Thảo

31
1

2

The [answer by CuriousCoder](https://stackoverflow.com/a/49536987/2745495) was already edited to fix the typo. So this post is not anymore needed. In the future, please suggest an edit instead of posting a separate answer. – Gino Mempin May 05 '23 at 01:59
Well this answer was provided before the *type* was fixed and it's still an attempt to solve the question. With the slightest punctuation/digit/comma missing, anything and specially `regex` could be out of air. I would like to encourage the user and keep this answer. – bonCodigo May 07 '23 at 05:38

Remove special characters in pandas dataframe

4 Answers4

Linked