How can I remove all non-numeric characters from all the values in a particular column in pandas dataframe?

Question

I have a dataframe which looks like this:

     A       B           C
1   red78   square    big235
2   green   circle    small123
3   blue45  triangle  big657

I need to be able to remove the non-numeric characters from all the rows in column C so that my dataframe looks like:

     A       B           C
1   red78   square    235
2   green   circle    123
3   blue45  triangle  657

I tried using the following but get the error expected string or buffer:

import re
dfOutput.imgID = dfOutput.imgID.apply(re.sub('[^0-9]','', dfOutput.imgID), axis = 0)

What should I do instead?

Code to create dataframe:

dfObject = pd.DataFrame()
dfObject.set_value(1, 'A', 'red78')
dfObject.set_value(1, 'B', 'square')
dfObject.set_value(1, 'C', 'big235')
dfObject.set_value(2, 'A', 'green')
dfObject.set_value(2, 'B', 'circle')
dfObject.set_value(2, 'C', 'small123')
dfObject.set_value(3, 'A', 'blue45')
dfObject.set_value(3, 'B', 'triangle')
dfObject.set_value(3, 'C', 'big657')

score 51 · Answer 1 · answered May 22 '17 at 16:02

51

Use str.extract and pass a regex pattern to extract just the numeric parts:

In[40]:
dfObject['C'] = dfObject['C'].str.extract('(\d+)', expand=False)
dfObject

Out[40]: 
        A         B    C
1   red78    square  235
2   green    circle  123
3  blue45  triangle  657

If needed you can cast to int:

dfObject['C'] = dfObject['C'].astype(int)

answered May 22 '17 at 16:02

EdChum

376,765
198
813
562

17

`extract`, used this way, will only extract one sequence of digit characters from the input fields. If there are multiple numeric parts separated by non-numeric characters, this answer will only extract the first sequence of characters. – meta4 Dec 31 '17 at 15:56
This gives me `E:\WPy-3662\python-3.6.6.amd64\lib\site-packages\ipykernel_launcher.py:1: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy """Entry point for launching an IPython kernel.` in Python 3.6.6 – Superdooperhero Feb 07 '19 at 17:35
1

@Superdooperhero it means that your df is a view or slice of the original df hence the warning, I can't comment without seeing your full code and how you caused this. Basically if you want to operate on a copy then you do `df1 = df.copy()` then whatever you do with `df1` generates no warning, but if you really want to operate on the original then you need to use `.loc` so it's explicit that you want to operate on a view, see related: https://stackoverflow.com/questions/20625582/how-to-deal-with-settingwithcopywarning-in-pandas – EdChum Feb 07 '19 at 17:47
Thanks. That answer fixes the problem. – Superdooperhero Feb 07 '19 at 17:51
`extract('(\d+)', expand=False)` seems to also loose the decimal point for floating point numbers. Is there any way to prevent that? – Superdooperhero Feb 07 '19 at 18:00
2

@Superdooperhero try regex pattern `'(\d+\.*\d*)'` or just use the other answer to replace the characters with empty strings and all that's left will be numeric characters – EdChum Feb 07 '19 at 18:02

score 28 · Answer 2 · answered Oct 01 '19 at 07:36

28

To remove all non-digit characters from strings in a Pandas column you should use str.replace with \D+ or [^0-9]+ patterns:

dfObject['C'] = dfObject['C'].str.replace(r'\D+', '')

Or, since in Python 3, \D is fully Unicode-aware by default and thus does not match non-ASCII digits (like ۱۲۳۴۵۶۷۸۹, see proof) you should consider

dfObject['C'] = dfObject['C'].str.replace(r'[^0-9]+', '')

So,

import re
print ( re.sub( r'\D+', '', '1۱۲۳۴۵۶۷۸۹0') )         # => 1۱۲۳۴۵۶۷۸۹0
print ( re.sub( r'[^0-9]+', '', '1۱۲۳۴۵۶۷۸۹0') )     # => 10

answered Oct 01 '19 at 07:36

Wiktor Stribiżew

607,720
39
448
563

3

arabic/hindi numerals, nice one ;) – Umar.H Feb 14 '20 at 17:51
My favourite so far (+1). – Tiago Martins Peres Jun 13 '20 at 08:08
Most comprehensive solution (+1) – Narin Dhatwalia May 08 '21 at 20:15

score 18 · Answer 3 · answered May 22 '17 at 16:06

18

You can use .str.replace with a regex:

dfObject['C'] = dfObject.C.str.replace(r"[a-zA-Z]",'')

output:

        A         B    C
1   red78    square  235
2   green    circle  123
3  blue45  triangle  657

answered May 22 '17 at 16:06

Scott Boston

147,308
15
139
187

7

Close. This answer will remove all alphanumeric characters. To remove all non-numeric characters use the `\D` character class – meta4 Dec 31 '17 at 15:51

score 11 · Answer 4 · answered Apr 01 '18 at 17:53

You can also do this via a lambda function with str.isdigit:

import pandas as pd

df = pd.DataFrame({'Name': ['John5', 'Tom 8', 'Ron 722']})

df['Name'] = df['Name'].map(lambda x: ''.join([i for i in x if i.isdigit()]))

#   Name
# 0    5
# 1    8
# 2  722

score 3 · Answer 5 · answered Jun 24 '19 at 12:07

After 2 years, to help others, I actually think that you were very close to the answer. I have used your logic but made it work. basically you create a function that does the clean up and then apply it to the column C.

import pandas as pd
import re

df = pd.DataFrame({
     'A': ['red78', 'green', 'blue45'],
     'B': ['square', 'circle', 'triangle'],
    'C': ['big235', 'small123',  'big657']
})

def remove_chars(s):
    return re.sub('[^0-9]+', '', s) 

df['C'] = df['C'].apply(remove_chars)
df

Result below:

A   B   C
0   red78   square  235
1   green   circle  123
2   blue45  triangle    657

score 0 · Answer 6 · answered Apr 27 '22 at 14:26

Suppose we want to remove all characters from all the columns in one.

import pandas as pd
 

df = pd.DataFrame({
     'A': ['red78', 'green', 'blue45'],
     'B': ['square', 'circle', 'triangle'],
    'C': ['big235', 'small123',  'big657']
})
 df[['A', 'C']] = df[['A', 'C']].replace(regex=[r'\D+'], value="") 

Output:
   A         B    C
0  78    square  235
1        circle
2  45  triangle  657

How can I remove all non-numeric characters from all the values in a particular column in pandas dataframe?

6 Answers6

Linked

Related