0

I wonder why does python pandas / numpy not implement 3-valued logic (so-called Łukasiewicz's logic) with true, false and NA (like for instance R does). I've read (https://www.oreilly.com/learning/handling-missing-data) that this is to some extent due to the fact that pandas uses much more many basic data types than R for example. However, this is not entirely clear to me why in this case it is unavoidable to have this weird behaviour of logical operations with missing values.

Example.

import numpy as np
np.nan and False   # so far so good, we have False
np.nan or False    # again, good, we have nan
False and np.nan   # False, good
False or np.nan    # give nan, so again, it is correct
np.nan and True    # weird, this gives True, while it should give nan
True and np.nan    # nan, so it is correct, but switching order should not affect the result
np.nan or True     # gives nan, which is not correct, should be True
True or np.nan     # True so it is correct, again switching the arguments changes the result

So the example shows that something very weird happens in comparisons between np.nan and True values. So what is going on here?

EDIT. Thanks for the comments, now I see that np.nan is considered a "truthy" value. So can anybody explain what does this mean exactly and what is a rationale behind this approach?

sztal
  • 287
  • 2
  • 10
  • Pandas 2.0 has a lot of changes, including how nulls are handled for non-float types. – Arya McCarthy May 11 '17 at 21:28
  • @aryamccarthy the above won't change with `pandas` 2.0, though. This is basic – juanpa.arrivillaga May 11 '17 at 21:34
  • 1
    For the record, very few languages make a distinction between true, false and some third "NA" value. Typically, either strong typing means only special constants have boolean meaning, or if many objects have boolean meaning, they all ultimately get treated as truthy or falsy. R having an NA value is unusual; general purpose programming languages almost never have such a value (you can write your own logic to simulate it, but ultimately the language only supports truthy or falsyness). – ShadowRanger May 11 '17 at 21:39
  • Yes, I understand that logical operations in R are quite special in this regard. However, both pandas and numpy are designed to solve similar problems as R, so I wonder why the 3-valued logic has not been built into these two modules? Is it due to some technical constraints or is it a, somehow rational, design decision of the authors? – sztal May 11 '17 at 21:45
  • @sztal note, you aren't using `pandas` in the above code. All of that is pure python, except you are using an attribute of the numpy module, `np.nan`, but that is the same as `float('nan')`, which is just vanilla Python, so you aren't really even using numpy. – juanpa.arrivillaga May 11 '17 at 21:52
  • Ok, I see now. I thought that `nan` is specificaly defined as a part of `numpy` (and then used by `pandas`). Thanks, that makes it more understandable to me. – sztal May 11 '17 at 21:56
  • pandas 1.0.0 was released in Jan. 2020 and it uses 3-valued-logic with the new pd.NA values https://pandas.pydata.org/pandas-docs/version/1.0.0/user_guide/missing_data.html#logical-operations – jeffhale Feb 04 '20 at 21:04

2 Answers2

1

This is numpy behaviour and, at least partially, inherited from python:

In [11]: bool(float('nan'))
Out[11]: True

In [12]: bool(np.NaN)
Out[12]: True

(NaN is "truthy".)

Andy Hayden
  • 359,921
  • 101
  • 625
  • 535
0

You wrongly misjudged or and and statements.

or would check if first value is True in form of bool(value) if it's False then it takes second value.

and on the other hand checks if two of the values are True at the same time in the form of bool(value1) and bool(value2)

vishes_shell
  • 22,409
  • 6
  • 71
  • 81
  • So how come `np.nan or True` gives `nan`. If one of the arguments is True then the logical or have to produce True regardless of the second argument. So in this case the results should be True, but it is not. And this proves that `np.nan` is not congruent with the 3-valued logic. – sztal May 11 '17 at 21:34
  • 3
    @sztal it's not. `np.nan` is considered "truthy" – juanpa.arrivillaga May 11 '17 at 21:35
  • It seems that in this case python check the first argument, sees that it is `np.nan` and declares (prematurely) that the results is undecidable, but it is very decidable, since on of the arguments is True, so the logical disjunction must be True as well. – sztal May 11 '17 at 21:36
  • @juanpa.arrivillaga thanks for backup, wanted to put that in answer, but wanted to try it first, but lack of python interpreter on mobile phone:) – vishes_shell May 11 '17 at 21:38
  • @juanpa.arrivillaga, what does "truthy" exactly mean? It seems to me that this kind of behaviour may be quite dangerous in some situations. – sztal May 11 '17 at 21:38
  • @sztal truthy means that when you apply `bool()` function to it, it return `True` – vishes_shell May 11 '17 at 21:40
  • Truthy values are values that evaluate to `True` when you use `bool(some_value)`. Yes, it can be dangerous if you are unaware an uncareful. For example, though, it is a common idiom to check if a container is not empty doing something like `if my_list: do_something()` or `if my_dict: do_something()` because empty built-in containers are falsey. – juanpa.arrivillaga May 11 '17 at 21:40
  • @sztal: Per [the docs](https://docs.python.org/3/reference/expressions.html#boolean-operations): *"In the context of Boolean operations, and also when expressions are used by control flow statements, the following values are interpreted as false: `False`, `None`, numeric zero of all types, and empty strings and containers (including strings, tuples, lists, dictionaries, sets and frozensets). All other values are interpreted as true. User-defined objects can customize their truth value by providing a `__bool__()` method."* Since NaN is numeric, and not 0, it should be truthy. – ShadowRanger May 11 '17 at 21:42
  • And that link also mentions the behavior of `and` and `or`, they don't return `True` or `False` specifically, they just return the last value evaluated (where short-circuiting means some values may not be evaluated at all). – ShadowRanger May 11 '17 at 21:44