When I was working on some set operations in Python, using numpy
and pandas
I came across a strange phenomenon, which, I would state, results in an inconsistency in nan
s handling.
Let's assume that we have a very simple situation with sets as our points of interest:
import numpy as np
import pandas as pd
a = {1, 2, 3, np.nan}
b = {1, 2, 4, np.nan}
print(a - b)
Out:
{3}
This is perfectly fine and we would expect it to be so, but let us continue with somewhat more complicated example, where pandas
series / data frame is included:
series = pd.Series([1, 2, 3, np.nan, 1, 2, 3])
d = set(series)
print(d)
Out:
{nan, 1.0, 2.0, 3.0}
Once again, perfectly fine. Though, when we call:
print(d - b) # the same applies to a single column of a data frame in place of a series
the result is (quite unexpectedly to me):
{nan, 3.0}
There is still a nan
value in the output.
I do understand that when we create the series
variable all of the input values are cast under the hood to a float64
format, including the nan
value.
type(series.iloc[3])
Out:
numpy.float64
Whereas the type of a freely created np.nan
is just float
. Of course, the np.isnan()
function in both cases returns True
. I still see it though as an inconsistency, because I would assume that all basic Python operations (to which set operations undoubtedly belong) will be treating nan
s in a similar manner to the numbers. Even if the same type conversion as in case of nan
s was applied to the numbers in the sets (in pure Python they are int
s, whereas in pandas
series they are float
s), set operations still consider them as the same entities and remove values adequately. nan
is supposed to be also (quasi-)numeric and yet is handled differently. Is this a feature, a bug or an acknowledged situation which cannot be for some reason resolved?
Python version: 3.6.6. Numpy version: 1.16.2. Pandas version: 0.24.2.