Replace NA in pandas DataFrame index, found only by value_counts

Question

In a relatively large dataset the modeling algo we use (lightgbm) has detected an unknown data type caused by a NA (not NaN) in the index (not values) of one of the numerical (at least originally) columns, raising this error message:

ValueError: DataFrame.dtypes for data must be int, float or bool.
Did not expect the data types in the following fields: <col_name>

The missing index value can be only found using value_counts - other methods miss it. How to replace it with an innocent string when it cannot be found (but still it blocks the modeling algo)?

Downcasting the column to integer or float did not help get rid of the missing in the index (and converted it to pandas extensions types that allow for missings, despite no missing values in the data - zero count of the NA index value).

More info

The data type of the column - with integer values - is UInt32Dtype, most likely because of this string-typed missing value (pandas.NA rather than numpy.nan) in the index:

test_df[col_name].value_counts(dropna=False).index
    Index([1048, 1040, 1041, 1049, 1047, 1046, 1050, 1044, 
           1043, 1042, 1051, 1045, 1052, <NA>], dtype='object')

As you see, the NA is only in the index, with no values (zero count) associated with it:

test_df[col_name].value_counts(dropna=False)
1048    123099
1040    115015
1041    114987
1049    114474
1047    114124
1046    112952
1050    112453
1044    111684
1043    110286
1042    108400
1051    106731
1045    102131
1052     42033
NaN          0
Name: <col_name>, dtype: Int64

Notice the pandas-only extension type Int64 (not numpy's int64), which can accommodate missings (including index-only missings like in this case), as shown above, while dtypes shows another (32-bit) type UInt32Dtype():

test_df[col_name].dtypes
UInt32Dtype()

What finds the NA:

test_df[col_name].value_counts(dropna=False).index.isna().sum()
1

What misses it (the list is probably not exhaustive):

test_df.index.isna().sum()
0

test_df[col_name].index.isna().sum()
0

test_df.index.isnull().sum()
0

test_df[col_name].index.isnull().sum()
0

(test_df.index == np.nan).any()
False

np.sum(test_df[col_name].index == pd.NA)
0

np.sum(test_df[col_name].index == "NA")
0

(test_df.index.fillna('No label') == test_df.index).all()
True

(test_df[col_name].index.fillna('No label') == test_df[col_name].index).all()
True

(pd.Series(test_df.index).replace(np.nan, 'No label') == test_df.index).all()
True

Trying to downcast it to a float fails:

test_df[col] = pd.to_numeric(test_df[col], errors='coerce', downcast="float")

.. because it gets converted to yet another novelty pandas-only extension type Float32 (rather than expected numpy's float64):

test_df[col].dtype
Float32Dtype()

.. and the NA still remains in the index:

test_df[col_name].value_counts(dropna=False).index
    Index([1048.0, 1040.0, 1041.0, 1049.0, 1047.0, 1046.0, 1050.0, 1044.0,
           1043.0, 1042.0, 1051.0, 1045.0, 1052.0,   <NA>], dtype='object')

As a preventive measure I will go over all instances of set_index in the pipeline to deduplicate and remove missings from the index columns (we set index on various key columns to improve join performance when adding new parts to this large dataset).

try `test_df[col_name].isna().sum()` - with value counts you get the unique values in the index and their counts as the values. In all of the other cases you have mentioned above, you're trying to search for an NA in the original index which ideally should not have any NAs, but the offending column would have it — Asish M., Dec 25 '21 at 13:37
Thanks @Asish M. - it still misses the NA in the index though... there are no missing values — mirekphd, Dec 25 '21 at 13:51

mirekphd · Answer 1 · 2021-12-25T15:33:50.803

This got rid of the persistent NA in the index: reconstructing all column values by passing them through a numpy array and changing its type (here: to int)), which apparently reconstructs also the index:

test_df[col_name] = test_df[col_name].values.astype(int)

A quick check:

test_df[col_name].value_counts(dropna=False).index.isna().sum()
0

test_df[col_name].value_counts(dropna=False)
1048    123099
1040    115015
1041    114987
1049    114474
1047    114124
1046    112952
1050    112453
1044    111684
1043    110286
1042    108400
1051    106731
1045    102131
1052     42033
Name: <col_name>, dtype: int64

Replace NA in pandas DataFrame index, found only by value_counts

1 Answers1