Sorry I cannot share the data. I tried to make test data but it does not gives same error or different missing values as described below.
Added more info at bottom about pd.NA
I am loading data with code:
df = pd.read_csv("C:/data.csv")
When loading data I am getting this warning:
C:\Users\User1\AppData\Local\Continuum\anaconda3\lib\site-packages\IPython\core\interactiveshell.py:3063: DtypeWarning: Columns (162,247,274,292,304,316,321,335,345,347,357,379,389,390,393,395,400,401,420,424,447,462,465,467,478,481,534,536,538,570,616,632,653,666,675,691,707,754,758,762,766,770,774,778,782,784,785,786,788,789,790,792,793,794,796,797,798,800,801,802,804,805,806,808,809,810,812,813,814,815,816,817,818,819,820,821,822,823,824,825,826,827,828,829,830,831,832,833,834,835,836,837,838,839,840,841,842,843,844,845,846,847,848,849,850,851,852,853,854,855,856,857,858,859,860,861,862,863,864,865,867,868,871,872,875,876,880,1367,1368,1370,1371,1373,1374,1376,1377,1379,1380,1382,1383,1385,1386,1388,1389,1391,1392,1394,1395,1397,1398,1400,1401,1403,1404,1406,1407,1409,1410,1412,1413,1415,1416,1418,1419,1421,1422,1424,1425,2681) have mixed types.Specify dtype option on import or set low_memory=False.
interactivity=interactivity, compiler=compiler, result=result)
As I understood from this question this warning is not a problem and I can ignore it.
After I am running this code from here:
# from: https://stackoverflow.com/questions/60101845/compare-multiple-pandas-columns-1st-and-2nd-after-3rd-and-4rth-after-etc-wit
# from: https://stackoverflow.com/questions/27474921/compare-two-columns-using-pandas?answertab=oldest#tab-top
# from: https://stackoverflow.com/questions/60099141/negation-in-np-select-condition
import pandas as pd
import numpy as np
col1 = ["var1", "var3", "var5"]
col2 = ["var2", "var4", "var6"]
colR = ["Result1", "Result2", "Result3"]
s1 = df[col1].isnull().to_numpy()
s2 = df[col2].isnull().to_numpy()
conditions = [~s1 & ~s2, s1 & s2, ~s1 & s2, s1 & ~s2]
choices = ["Both values", np.nan, df[col1], df[col2]]
df = pd.concat([df, pd.DataFrame(np.select(conditions, choices), columns=colR, index=df.index)], axis=1)
Newly created columns apon running code above contain nan
but colums that are loaded from csv
file contain NaN
.
After running df['var1'].value_counts(dropna=False)
, I am getting output:
NaN 3453
0.0 3002
1.0 314
Name: var1, dtype: int64
After running df['Result1'].value_counts(dropna=False)
, I am getting output:
0.0 3655
nan 2665
1.0 407
Both values 42
Name: Result1, dtype: int64
Notice that var1
contains NaN
values but Result1
contains nan
values.
When I run df['var1'].value_counts(dropna=False).loc[[np.nan]]
I am getting output:
NaN 3453
Name: weeklyivr_q1, dtype: int64
When I run df['Result1'].value_counts(dropna=False).loc[[np.nan]]
I am getting error (variable names in error are different but key idea is that there are no missing values):
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
<ipython-input-52-0daeac75fdb4> in <module>
27 #combined_IVR["weeklyivr_q1"].value_counts(dropna=False)
28 #combined_IVR["my_weekly_ivr_1"].value_counts(dropna=False).loc[["Both values"]]
---> 29 combined_IVR["my_weekly_ivr_1"].value_counts(dropna=False).loc[[np.nan]]
30 #combined_IVR["weeklyivr_q1"].value_counts(dropna=False).loc[[np.nan]]
~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\indexing.py in __getitem__(self, key)
1764
1765 maybe_callable = com.apply_if_callable(key, self.obj)
-> 1766 return self._getitem_axis(maybe_callable, axis=axis)
1767
1768 def _is_scalar_access(self, key: Tuple):
~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\indexing.py in _getitem_axis(self, key, axis)
1950 raise ValueError("Cannot index with multidimensional key")
1951
-> 1952 return self._getitem_iterable(key, axis=axis)
1953
1954 # nested tuple slicing
~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\indexing.py in _getitem_iterable(self, key, axis)
1591 else:
1592 # A collection of keys
-> 1593 keyarr, indexer = self._get_listlike_indexer(key, axis, raise_missing=False)
1594 return self.obj._reindex_with_indexers(
1595 {axis: [keyarr, indexer]}, copy=True, allow_dups=True
~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\indexing.py in _get_listlike_indexer(self, key, axis, raise_missing)
1549
1550 self._validate_read_indexer(
-> 1551 keyarr, indexer, o._get_axis_number(axis), raise_missing=raise_missing
1552 )
1553 return keyarr, indexer
~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\indexing.py in _validate_read_indexer(self, key, indexer, axis, raise_missing)
1636 if missing == len(indexer):
1637 axis_name = self.obj._get_axis_name(axis)
-> 1638 raise KeyError(f"None of [{key}] are in the [{axis_name}]")
1639
1640 # We (temporarily) allow for some missing keys with .loc, except in
KeyError: "None of [Float64Index([nan], dtype='float64')] are in the [index]"
When I am running df['Result1'].value_counts(dropna=False).loc[['nan']]
I am getting:
nan 2665
Name: my_weekly_ivr_1, dtype: int64
So nan
in 'Result1' column is string.
If i replace choices = ["Both values", np.nan, df[col1], df[col2]]
with choices = ["Both values", pd.NA, df[col1], df[col2]]
and after run:
df['Result1'].value_counts(dropna=False).loc[[np.nan]]
I am getting output:
NaN 2665
Name: Result1, dtype: int64
So in this case np.nan
produces string and pd.NA
missing value.
Question:
Why am getting nan
in 'Result1' column when using np.nan? What can be a reason and how to fix this?