Removed unwanted characters from string using pandas

Question

I have the following dataframe:

df = pd.DataFrame({'A': ['2.5cm','2.5cm','2.56”','1.38”','2.2”','0.8 in','$18.00','4','2"']})

which looks like:

  A
2.5cm
2.5cm
2.56”
1.38”
2.2”
0.8 in
$18.00
4
2"

I want to remove all characters except for the decimal points.

The output should be:

 A
2.5
2.5
2.56
1.38
2.2
0.8
18.00
4
2

Here is what I've tried:

df['A'] = df.A.str.replace(r"[a-zA-Z]", '')
df['A'] = df.A.str.replace('\W', '')

but this is stripping out everything including the decimal point.

Any suggestions would be greatly appreciated.

Thank you in advance

What about following your way: `df.A.str.replace(r"[^\d.]", "")`: removing any non-digits except for the dot? — , Jan 24 '22 at 19:26
The `^` in the regex negates what's inside; so if it's not a digit nor a literal dot, we remove it. https://regex101.com/r/eIYEin/1. — , Jan 24 '22 at 19:29

score 2 · Accepted Answer · 2022-01-24T23:04:57.823

You can use str.extract to extract only the floating points:

df['A'] = df['A'].astype(str).str.extract(r'(\d+.\d+|\d)').astype('float')

However, '.' here matches any character, not just the period. So it will match 18,00 instead of 18. Also it fails to extract multidigit whole numbers. Use the code below. (thanks @DYZ):

df['A'] = df['A'].astype(str).str.extract(r'(\d+\.\d+|\d+)').astype('float')

Output:

score 1 · Answer 2 · answered Jan 24 '22 at 19:08

1

Try with str.extract

df['new'] = df.A.str.extract('(\d*\.\d+|\d+)').astype(float).iloc[:,0]
Out[31]: 
       0
0   2.50
1   2.50
2   2.56
3   1.38
4   2.20
5   0.80
6  18.00

answered Jan 24 '22 at 19:08

BENY

317,841
20
164
234

Or `'(\d*\.\d+|\d+\.?)'` to match numbers like 34. (with a period but no fractional part). – DYZ Jan 24 '22 at 22:41

Removed unwanted characters from string using pandas

2 Answers2