2

When I was working on some set operations in Python, using numpy and pandas I came across a strange phenomenon, which, I would state, results in an inconsistency in nans handling.

Let's assume that we have a very simple situation with sets as our points of interest:

import numpy as np
import pandas as pd

a = {1, 2, 3, np.nan}
b = {1, 2, 4, np.nan}

print(a - b)

Out:

{3}

This is perfectly fine and we would expect it to be so, but let us continue with somewhat more complicated example, where pandas series / data frame is included:

series = pd.Series([1, 2, 3, np.nan, 1, 2, 3])
d = set(series)
print(d)

Out:

{nan, 1.0, 2.0, 3.0}

Once again, perfectly fine. Though, when we call:

print(d - b)  # the same applies to a single column of a data frame in place of a series

the result is (quite unexpectedly to me):

{nan, 3.0}

There is still a nan value in the output.

I do understand that when we create the series variable all of the input values are cast under the hood to a float64 format, including the nan value.

type(series.iloc[3])

Out:

numpy.float64

Whereas the type of a freely created np.nan is just float. Of course, the np.isnan() function in both cases returns True. I still see it though as an inconsistency, because I would assume that all basic Python operations (to which set operations undoubtedly belong) will be treating nans in a similar manner to the numbers. Even if the same type conversion as in case of nans was applied to the numbers in the sets (in pure Python they are ints, whereas in pandas series they are floats), set operations still consider them as the same entities and remove values adequately. nan is supposed to be also (quasi-)numeric and yet is handled differently. Is this a feature, a bug or an acknowledged situation which cannot be for some reason resolved?

Python version: 3.6.6. Numpy version: 1.16.2. Pandas version: 0.24.2.

Garrus990
  • 88
  • 7
  • 1
    Somewhat related: https://stackoverflow.com/questions/3942303/how-does-a-python-set-check-if-two-objects-are-equal-what-methods-does-an-o My Google-fu could not find how Python implements the set difference exactly. Testing (in Python 2.7) shows it is neither `==` (because `{np.nan} - {np.nan}` returns an empty set) nor `is` (ditto for `{1.0} - {1}`). – Leporello May 24 '19 at 10:06
  • Thanks @Leporello, this is an important factor for this question (determination of equality of objects), but there is also another side to this inquiry: whether the observed phenomenon is desired (if yes - in what situations is it beneficial) or rather erroneous. Still though, very useful observation! – Garrus990 May 24 '19 at 11:06
  • On equality of nan's https://stackoverflow.com/a/56193097/901925 – hpaulj May 24 '19 at 11:53
  • `float('nan')` and `np.nan` are different objects (don't match with `is`), even though both are `float` and satisfy `np.isnan`. And being `nan` they aren't `==`. So a `set` will contain both. More generally, using `set` with floats is unreliable. – hpaulj May 24 '19 at 16:42
  • Well, that was actually my point. Thanks to your comments I understand what's behind the phenomenon I am asking about, but one thing remains: the answer to the question whether that is actually consistent with the rest of the pythonic world, as in my view - it isn't. I would expect the same behavior from `nan`s as from `float`s (as `nan`s are quasi-numeric) and currently I do not see any consistency in that. – Garrus990 May 29 '19 at 12:17

0 Answers0