0

A weird thing - i have a dataframe, lets call it ID. While importing xlsx source file, I do .astype({"ID_1": str, "ID_2": str})

Yet, for example instead of 10300 I get 10300.0. Moreover, then I get string "nan" as well.

In order to fix both issues I did this rubbish:

my_df['ID_1'].replace(['None', 'nan'], np.nan, inplace=True)
my_df[my_df['ID_1'].notnull()].ID_1.astype(float).astype(int).astype(str)

As a result I still have these 10300.0 Any thoughts how to fix these? I could keep it as float while importing data, instead of .astype, but it does not change anything.

funkurlif3
  • 21
  • 2
  • Pandas has int-types that allow for missing values: Try `.astype({"ID_1": "Int64", "ID_2": "Int64"})` for example. (Have a look [here](https://pandas.pydata.org/pandas-docs/stable/user_guide/integer_na.html).) – Timus Jul 06 '22 at 12:33

1 Answers1

0

The issue is that int cannot represent NaN value, so pandas converts the column to float.

It is a common pitfall, as the presence of additional rows with missing data can change the result of a given row.

You can however pick a specific pandas type to indicate that it is an integer with missing values, see Convert Pandas column containing NaNs to dtype `int`, especially the link https://pandas.pydata.org/pandas-docs/stable/user_guide/integer_na.html

Coding thermodynamist
  • 1,340
  • 1
  • 10
  • 18