In a relatively large dataset the modeling algo we use (lightgbm
) has detected an unknown data type caused by a NA
(not NaN
) in the index (not values) of one of the numerical (at least originally) columns, raising this error message:
ValueError: DataFrame.dtypes for data must be int, float or bool.
Did not expect the data types in the following fields: <col_name>
The missing index value can be only found using value_counts
- other methods miss it. How to replace it with an innocent string when it cannot be found (but still it blocks the modeling algo)?
Downcasting the column to integer or float did not help get rid of the missing in the index (and converted it to pandas extensions types that allow for missings, despite no missing values in the data - zero count of the NA index value).
More info
The data type of the column - with integer values - is UInt32Dtype
, most likely because of this string-typed missing value (pandas.NA
rather than numpy.nan
) in the index:
test_df[col_name].value_counts(dropna=False).index
Index([1048, 1040, 1041, 1049, 1047, 1046, 1050, 1044,
1043, 1042, 1051, 1045, 1052, <NA>], dtype='object')
As you see, the NA
is only in the index, with no values (zero count) associated with it:
test_df[col_name].value_counts(dropna=False)
1048 123099
1040 115015
1041 114987
1049 114474
1047 114124
1046 112952
1050 112453
1044 111684
1043 110286
1042 108400
1051 106731
1045 102131
1052 42033
NaN 0
Name: <col_name>, dtype: Int64
Notice the pandas-only extension type Int64
(not numpy
's int64
), which can accommodate missings (including index-only missings like in this case), as shown above, while dtypes
shows another (32-bit) type UInt32Dtype()
:
test_df[col_name].dtypes
UInt32Dtype()
What finds the NA
:
test_df[col_name].value_counts(dropna=False).index.isna().sum()
1
What misses it (the list is probably not exhaustive):
test_df.index.isna().sum()
0
test_df[col_name].index.isna().sum()
0
test_df.index.isnull().sum()
0
test_df[col_name].index.isnull().sum()
0
(test_df.index == np.nan).any()
False
np.sum(test_df[col_name].index == pd.NA)
0
np.sum(test_df[col_name].index == "NA")
0
(test_df.index.fillna('No label') == test_df.index).all()
True
(test_df[col_name].index.fillna('No label') == test_df[col_name].index).all()
True
(pd.Series(test_df.index).replace(np.nan, 'No label') == test_df.index).all()
True
Trying to downcast it to a float
fails:
test_df[col] = pd.to_numeric(test_df[col], errors='coerce', downcast="float")
.. because it gets converted to yet another novelty pandas-only extension type Float32
(rather than expected numpy
's float64
):
test_df[col].dtype
Float32Dtype()
.. and the NA
still remains in the index:
test_df[col_name].value_counts(dropna=False).index
Index([1048.0, 1040.0, 1041.0, 1049.0, 1047.0, 1046.0, 1050.0, 1044.0,
1043.0, 1042.0, 1051.0, 1045.0, 1052.0, <NA>], dtype='object')
As a preventive measure I will go over all instances of set_index
in the pipeline to deduplicate and remove missings from the index columns (we set index on various key columns to improve join performance when adding new parts to this large dataset).
Related: