0

I have the following dataframe:

df = pd.DataFrame({'A': ['2.5cm','2.5cm','2.56”','1.38”','2.2”','0.8 in','$18.00','4','2"']})

which looks like:

  A
2.5cm
2.5cm
2.56”
1.38”
2.2”
0.8 in
$18.00
4
2"

I want to remove all characters except for the decimal points.

The output should be:

 A
2.5
2.5
2.56
1.38
2.2
0.8
18.00
4
2

Here is what I've tried:

df['A'] = df.A.str.replace(r"[a-zA-Z]", '')
df['A'] = df.A.str.replace('\W', '')

but this is stripping out everything including the decimal point.

Any suggestions would be greatly appreciated.

Thank you in advance

ApacheOne
  • 245
  • 2
  • 14
  • What about following your way: `df.A.str.replace(r"[^\d.]", "")`: removing any non-digits except for the dot? –  Jan 24 '22 at 19:26
  • The `^` in the regex negates what's inside; so if it's not a digit nor a literal dot, we remove it. https://regex101.com/r/eIYEin/1. –  Jan 24 '22 at 19:29

2 Answers2

2

You can use str.extract to extract only the floating points:

df['A'] = df['A'].astype(str).str.extract(r'(\d+.\d+|\d)').astype('float')

However, '.' here matches any character, not just the period. So it will match 18,00 instead of 18. Also it fails to extract multidigit whole numbers. Use the code below. (thanks @DYZ):

df['A'] = df['A'].astype(str).str.extract(r'(\d+\.\d+|\d+)').astype('float')

Output:

       A
0   2.50
1   2.50
2   2.56
3   1.38
4   2.20
5   0.80
6  18.00
7   4.00
8   2.00
1

Try with str.extract

df['new'] = df.A.str.extract('(\d*\.\d+|\d+)').astype(float).iloc[:,0]
Out[31]: 
       0
0   2.50
1   2.50
2   2.56
3   1.38
4   2.20
5   0.80
6  18.00
BENY
  • 317,841
  • 20
  • 164
  • 234
  • Or `'(\d*\.\d+|\d+\.?)'` to match numbers like 34. (with a period but no fractional part). – DYZ Jan 24 '22 at 22:41