0

I want to convert a column to integer but the problem is that the column contains a missing value. The column converts to float fine, but cant convert to integer.

Sample code:

d2 = {'location': ['NY', 'NY', 'PA', 'NY', 'PA', 'PA', 'NY'], 'dep_name': ['hr', 'mk', 'fin', 'fin', 'hr', 'fin', 'fin'], 'Duration_of_Employment' : [10, 5, 9, 8, 2, 4, 7], 'Salary' : [50000, 86000,25000, 73000, 28000, 60000, 40000], 'Days_Since_Last_Promotion': ['61', '35', '25', '98', 'NaN', '45', '22']}
df2 = pd.DataFrame(data = d2)

df2['xy']  = df2['Days_Since_Last_Promotion'].astype(float)
df2['Months_Since_Last_Promotion'] = df2['xy'] // 30

Now 'Months_Since_Last_Promotion' is float type. But when I try to convert it to integer I get the following error.

df2['Months_Since_Last_Promotion'] = df2['Months_Since_Last_Promotion'].astype(int)

ValueError: Cannot convert NA to integer

From the error, I figured its due to the missing value Nan and tried this work around .But it didnt work and 'Months_Since_Last_Promotion' is still showing as float64.

df2.loc[df2['Months_Since_Last_Promotion'].notnull(), 'Months_Since_Last_Promotion'] = df2.loc[df2['Months_Since_Last_Promotion'].notnull(), 'Months_Since_Last_Promotion'].astype(int)

Note: I cant use fillna to replace the NaN. The goal is to keep the column as integer.

Deduplicator
  • 44,692
  • 7
  • 66
  • 118
singularity2047
  • 951
  • 4
  • 18
  • 28
  • 1
    not sure why you can't use the fillna, can you elaborate on that? – Yuca Dec 07 '18 at 21:53
  • Possible duplicate of [Convert Pandas column containing NaNs to dtype \`int\`](https://stackoverflow.com/questions/21287624/convert-pandas-column-containing-nans-to-dtype-int) – ALollz Dec 07 '18 at 22:00
  • the requirement is such that we dont want 0 or -1 or sth like that in place of NaN. But I guess in worst case I will take that path – singularity2047 Dec 07 '18 at 22:15

2 Answers2

3

Numeric columns containing NaN values are stored as floats by default (even if all other numbers are integers) - this is because of typecasting restrictions in pandas. What this means is that if you want to retain the NaN as is without filling the missing value, casting the column to an integer may not be possible (to the best of my knowledge). Here's an excerpt from the documentation:

"While pandas supports storing arrays of integer and boolean type, these types are not capable of storing missing data. Until we can switch to using a native NA type in NumPy, we’ve established some “casting rules”. When a reindexing operation introduces missing data, the Series will be cast according to the rules introduced in the table below."

Please refer to:

https://pandas.pydata.org/pandas-docs/stable/missing_data.html#missing-data-casting-rules-and-indexing

Vishnu Kunchur
  • 1,716
  • 8
  • 9
1

Actually there is a way for that:
https://pandas.pydata.org/pandas-docs/stable/user_guide/integer_na.html

in your case:

df2['Months_Since_Last_Promotion'] = pd.array(df2['Months_Since_Last_Promotion'], dtype=pd.Int64Dtype())

However, it is important to note that other softwares may not be able to recognize this column as an int column. I think this have something to do with NaN being a float in python.