5
df = pd.read_csv("data.csv", encoding = "ISO-8859-1")

Now, I have a column where I have values are as below:

Sample data for reference:

enter image description here

Now, i want to convert the column a to a numeric format using below code:

df[['A']] = df[['A']].astype(int)

and it gives me an error. The problem is I have all three (nan, hyphen and comma) all in one column and need to address them together. Is there any better way to convert these without replacing (nan to -1) and things like that?

Cœur
  • 37,241
  • 25
  • 195
  • 267
  • Please consider posting data as text, not as image. This will increase your chance of getting high quality answer to your question. – Sergey Bushmanov Jan 15 '19 at 07:50
  • you cannot covert a number with comma into a numeric format, you have to relieve comma. Also, if there are `NaN` (also assuming `-` to be converted to `NaN`) then overall type can be made to `float64` – meW Jan 15 '19 at 07:50

2 Answers2

4

Use parameters thousands and na_values, but converting to integers is not possible with missing values, because now at least one NaN value cast column to floats, see this. So possible solution is replace them to int, e.g. -1 and then cast to integer:

Notice - In new version of pandas (0.24.0, coming soon) pandas has gained the ability to hold integer dtypes with missing values, Nullable Integer Data Type.

import pandas as pd

temp=u'''A
2254
"1,234"
"3,385"
nan
-
-
nan'''
#after testing replace 'pd.compat.StringIO(temp)' to 'data.csv'
df = pd.read_csv(pd.compat.StringIO(temp), 
                 encoding = "ISO-8859-1", 
                 thousands=',', 
                 na_values='-')

print (df)
        A
0  2254.0
1  1234.0
2  3385.0
3     NaN
4     NaN
5     NaN
6     NaN

df['A'] = df['A'].fillna(-1).astype(int)
print (df)
      A
0  2254
1  1234
2  3385
3    -1
4    -1
5    -1
6    -1
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
0

Maybe should do pd.to_numeric with errors='coerce' and str.replace:

df['A'] = pd.to_numeric(df['A'].str.replace(',',''),errors='coerce')

And now:

print(df['A'])

Is:

0    2254.0
1    1234.0
2    3385.0
3       NaN
4       NaN
5       NaN
6       NaN
Name: A, dtype: float64
U13-Forward
  • 69,221
  • 14
  • 89
  • 114