How can I remove the thousand comma separator when converting data frame columns?

Question

Given the following data frame:

State,City,Population,Poverty_Rate,Median_Age, 
VA,XYZ,.,10.5%,42, 
MD,ABC,"12,345",8.9%,., 
NY,.,987,654,.,41, 
...

import pandas as pd
df = pd.read_csv("/path... /sample_data")

df.dtypes returns

State          Object
City           Object
Population     Object
Proverty_Rate  Object
Median_Age     Object

I attempt to convert the data type of appropriate columns to int or float:

df = df.astype({"Population": int, "Proverty_rate": float, "Median_Age": int })

I received

Value Error: invalid literal for int() with base 10: '12,345'

I suspect the comma separator is causing this problem. How can I remove those from my dataset?

Yes, when you read in the csv, use the `thousands` parameter. `pd.read_csv('file.csv', thousands=',')` — Trenton McKinney, May 13 '20 at 23:58

score 3 · Accepted Answer · edited Jun 14 '21 at 04:13

There is an argument in Pandas DataFrame as pd.read_csv(thousands=',') which is set to None by default.

data = """
State   City    Population Poverty_Rate  Median_Age
VA      XYZ     500,00          10.5%         42
MD      ABC     12,345      8.9%          .
NY      .       987,654     .             41"""

from io import StringIO
import pandas as pd

df = pd.read_csv(StringIO(data),sep='\s+',thousands=',')

print(df)

  State City  Population Poverty_Rate Median_Age
0    VA  XYZ       50000        10.5%         42
1    MD  ABC       12345         8.9%          .
2    NY    .      987654            .         41

Ideally, what you need to do is replace the string markers and then coerce your string columns into integers/floats.

#using your dict.
int_cols = ({"Population": int, "Poverty_Rate": float, "Median_Age": int })

for col in int_cols.keys():
    df[col] = pd.to_numeric(df[col].astype(str).str.replace('%',''),errors='coerce')

print(df.dtypes)

State            object
City             object
Population        int64
Poverty_Rate    float64
Median_Age      float64
dtype: object


print(df)

  State City  Population  Poverty_Rate  Median_Age
0    VA  XYZ       50000          10.5        42.0
1    MD  ABC       12345           8.9         NaN
2    NY    .      987654           NaN        41.0

Thanks. I should included that the .csv file itself is comma delimited, and large numbers are stored as ..., "1,234", ... — Hank, May 16 '20 at 01:43

score 1 · Answer 2 · answered May 13 '20 at 23:52

1

Could you try the following? First do a str.replace on the column before you cast it to an integer?

import pandas as pd

df = pd.DataFrame([
    {'value': '123,445'},
    {'value': '143,445,788'}
])
df['value'] = df['value'].str.replace(',', '').astype(int)

answered May 13 '20 at 23:52

JQadrad

541
2
16

Please try this. This should get rid of all commas and the non wanted dots including the dot in the `Poverty_Rate` column m=df['Poverty_Rate'].str.contains('%') df.loc[:, df.columns != 'Poverty_Rate']=df.loc[:, df.columns != 'Poverty_Rate'].apply(lambda x: x.str.replace('[\,\.]','')) df.loc[~m, 'Poverty_Rate']=df.loc[~m, 'Poverty_Rate'].str.replace('[\.]','') df – wwnde May 14 '20 at 00:23

How can I remove the thousand comma separator when converting data frame columns?

2 Answers2

Linked

Related