Why I am getting nan as string when using np.nan and missing value when using pd.NA?

Question

Sorry I cannot share the data. I tried to make test data but it does not gives same error or different missing values as described below.

Added more info at bottom about pd.NA

I am loading data with code:

df = pd.read_csv("C:/data.csv")

When loading data I am getting this warning:

C:\Users\User1\AppData\Local\Continuum\anaconda3\lib\site-packages\IPython\core\interactiveshell.py:3063: DtypeWarning: Columns (162,247,274,292,304,316,321,335,345,347,357,379,389,390,393,395,400,401,420,424,447,462,465,467,478,481,534,536,538,570,616,632,653,666,675,691,707,754,758,762,766,770,774,778,782,784,785,786,788,789,790,792,793,794,796,797,798,800,801,802,804,805,806,808,809,810,812,813,814,815,816,817,818,819,820,821,822,823,824,825,826,827,828,829,830,831,832,833,834,835,836,837,838,839,840,841,842,843,844,845,846,847,848,849,850,851,852,853,854,855,856,857,858,859,860,861,862,863,864,865,867,868,871,872,875,876,880,1367,1368,1370,1371,1373,1374,1376,1377,1379,1380,1382,1383,1385,1386,1388,1389,1391,1392,1394,1395,1397,1398,1400,1401,1403,1404,1406,1407,1409,1410,1412,1413,1415,1416,1418,1419,1421,1422,1424,1425,2681) have mixed types.Specify dtype option on import or set low_memory=False.
  interactivity=interactivity, compiler=compiler, result=result)

As I understood from this question this warning is not a problem and I can ignore it.

After I am running this code from here:

# from: https://stackoverflow.com/questions/60101845/compare-multiple-pandas-columns-1st-and-2nd-after-3rd-and-4rth-after-etc-wit
# from: https://stackoverflow.com/questions/27474921/compare-two-columns-using-pandas?answertab=oldest#tab-top
# from: https://stackoverflow.com/questions/60099141/negation-in-np-select-condition

import pandas as pd
import numpy as np



col1 = ["var1", "var3", "var5"]
col2 = ["var2", "var4", "var6"]
colR = ["Result1", "Result2", "Result3"]

s1 = df[col1].isnull().to_numpy()
s2 = df[col2].isnull().to_numpy()

conditions = [~s1 & ~s2, s1 & s2, ~s1 & s2, s1 & ~s2]
choices = ["Both values", np.nan, df[col1], df[col2]]

df = pd.concat([df, pd.DataFrame(np.select(conditions, choices), columns=colR, index=df.index)], axis=1)

Newly created columns apon running code above contain nan but colums that are loaded from csv file contain NaN.

After running df['var1'].value_counts(dropna=False), I am getting output:

NaN    3453
0.0    3002
1.0     314
Name: var1, dtype: int64

After running df['Result1'].value_counts(dropna=False), I am getting output:

0.0            3655
nan            2665
1.0             407
Both values      42
Name: Result1, dtype: int64

Notice that var1 contains NaN values but Result1 contains nan values.

When I run df['var1'].value_counts(dropna=False).loc[[np.nan]] I am getting output:

NaN    3453
Name: weeklyivr_q1, dtype: int64

When I run df['Result1'].value_counts(dropna=False).loc[[np.nan]] I am getting error (variable names in error are different but key idea is that there are no missing values):

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-52-0daeac75fdb4> in <module>
     27 #combined_IVR["weeklyivr_q1"].value_counts(dropna=False)
     28 #combined_IVR["my_weekly_ivr_1"].value_counts(dropna=False).loc[["Both values"]]
---> 29 combined_IVR["my_weekly_ivr_1"].value_counts(dropna=False).loc[[np.nan]]
     30 #combined_IVR["weeklyivr_q1"].value_counts(dropna=False).loc[[np.nan]]

~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\indexing.py in __getitem__(self, key)
   1764 
   1765             maybe_callable = com.apply_if_callable(key, self.obj)
-> 1766             return self._getitem_axis(maybe_callable, axis=axis)
   1767 
   1768     def _is_scalar_access(self, key: Tuple):

~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\indexing.py in _getitem_axis(self, key, axis)
   1950                     raise ValueError("Cannot index with multidimensional key")
   1951 
-> 1952                 return self._getitem_iterable(key, axis=axis)
   1953 
   1954             # nested tuple slicing

~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\indexing.py in _getitem_iterable(self, key, axis)
   1591         else:
   1592             # A collection of keys
-> 1593             keyarr, indexer = self._get_listlike_indexer(key, axis, raise_missing=False)
   1594             return self.obj._reindex_with_indexers(
   1595                 {axis: [keyarr, indexer]}, copy=True, allow_dups=True

~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\indexing.py in _get_listlike_indexer(self, key, axis, raise_missing)
   1549 
   1550         self._validate_read_indexer(
-> 1551             keyarr, indexer, o._get_axis_number(axis), raise_missing=raise_missing
   1552         )
   1553         return keyarr, indexer

~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\indexing.py in _validate_read_indexer(self, key, indexer, axis, raise_missing)
   1636             if missing == len(indexer):
   1637                 axis_name = self.obj._get_axis_name(axis)
-> 1638                 raise KeyError(f"None of [{key}] are in the [{axis_name}]")
   1639 
   1640             # We (temporarily) allow for some missing keys with .loc, except in

KeyError: "None of [Float64Index([nan], dtype='float64')] are in the [index]"

When I am running df['Result1'].value_counts(dropna=False).loc[['nan']] I am getting:

nan    2665
Name: my_weekly_ivr_1, dtype: int64

So nan in 'Result1' column is string.

If i replace choices = ["Both values", np.nan, df[col1], df[col2]] with choices = ["Both values", pd.NA, df[col1], df[col2]] and after run:

df['Result1'].value_counts(dropna=False).loc[[np.nan]]

I am getting output:

NaN    2665
Name: Result1, dtype: int64

So in this case np.nan produces string and pd.NA missing value.

Question:

Why am getting nan in 'Result1' column when using np.nan? What can be a reason and how to fix this?

I am trying to solve this problem by using `pd.NA` in `choices` variable instead of `np.nan` and after all code running `df = df.fillna(np.nan)` which seems work, but I am not sure how safe is this way. — vasili111, Mar 06 '20 at 21:01
Hello, what about replacing `np.nan` with `np.NaN` in choices ? Therefore, the newly created columns will also have NaN representation. — Raphaele Adjerad, Apr 08 '20 at 06:30

Why I am getting nan as string when using np.nan and missing value when using pd.NA?

0 Answers0

Linked