Unable to strip non-numeric values from column in Pandas dataframe

Question

I’m working on cleaning and EDA of a time series dataset of revenues. For some of the entries, the values are prefaced with an ‘(R) ‘ meaning the value has been revised, and is shown like (R) 1000. Example:

df = pd.DataFrame({
    'year': ['2005', '2006', '2007'], 
    'revenue': [500, (R) 1000, 2200]})

Strangely, the data type for this column is still showing as float64 and works when compiling a lineplot. In the original Excel spreadsheet, when going to highlight a particular cell, the (R) disappears and only displays the numerical value.

I have developed a working code as follows:

df['revenue'] = df['revenue'].replace('(R) ','', regex=True)

This code does not return any errors, but it is unsuccessful in removing the (R) values from this column when looking at the dataframe. This (R) seems to work as some kind of placeholder, but I cannot figure out how to remove it, and it conflicts with my other data.

Basically, I just want to change values such as (R) 1000 to 1000

Please post some examples and with expected output. https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples — Scott Boston, Aug 16 '22 at 17:25
@NigelThornberry7, as mentioned, the sample code does not work. Are you loading the data from Excel as a CSV? My guess is that the source of your unwanted behavior stems from there. — Juancheeto, Aug 16 '22 at 17:39

mozway · Answer 1 · 2022-08-16T17:43:42.230

Assuming:

df = pd.DataFrame({
    'year': ['2005', '2006', '2007'], 
    'revenue': [500, '(R) 1000', 2200]})

You can use:

df['revenue'] = (df['revenue'].str.extract('(\d+)$', expand=False)
                 .fillna(df['revenue'])
                 .astype(int)
                 )

Output:

   year  revenue
0  2005      500
1  2006     1000
2  2007     2200

previous answer

Use pandas.to_numeric:

df['revenue'] = pd.to_numeric(df['revenue'], errors='coerce')

To replace with a given value, combine with fillna:

df['revenue'] = pd.to_numeric(df['revenue'], errors='coerce').fillna(1000)

score 0 · Answer 2 · answered Aug 16 '22 at 17:48

0

This should remove all letters and parenthesis from your strings

df['revenue'].replace('[A-Za-z)(]','',regex=True).astype(int)

answered Aug 16 '22 at 17:48

rhug123

7,893
1
9
24

Unable to strip non-numeric values from column in Pandas dataframe

2 Answers2

previous answer