20
import numpy as np
a = np.array([0])
b = np.array([None])
c = np.array([''])
d = np.array([' '])

Why should we have this inconsistency:

>>> bool(a)
False
>>> bool(b)
False
>>> bool(c)
True
>>> bool(d)
False
Alex Riley
  • 169,130
  • 45
  • 262
  • 238
wim
  • 338,267
  • 99
  • 616
  • 750
  • 1
    Honestly, I think it would be better if 1D 1-element arrays didn't act try to like scalars here and just returned the same `ValueError` as any other array. But if they are going to do this, they should probably actually act like scalars and let Python use its normal rules (so `bool(self[0])`). But maybe there's some good reason for this… – abarnert May 05 '15 at 03:50
  • 1
    I was going to tell you to see this question which talks about a bug,.... but then I realized that you are the same person... Good luck! I would like to see a canonical source on this too. – Karl May 05 '15 at 03:51
  • I think they should behave like any other containers - return False for empty and True otherwise. – wim May 05 '15 at 03:54
  • That could be confusing as well—most all Python operators and functions magically act element-wise instead of container-wise on arrays; having some that were different (consider the `and` and `or` operators) would mean more to keep in your head. – abarnert May 05 '15 at 03:55
  • 1
    [Here](http://mail.scipy.org/pipermail/numpy-discussion/2014-November/071672.html) is a somewhat related mailing list discussion suggesting there may be some bugs, or at least surprising corner cases, lurking in the `__nonzero__` handling. – BrenBarn May 05 '15 at 04:10
  • @BrenBarn: I don't think that's related. It's about how NumPy deals with Python 2's `__nonzero__` vs. Python 3's `__bool__`. It supports both in both versions, but in a clunky way that was initially broken in Python 3, and now is correct in both, but its clunkiness can still be exposed by trying to use the wrong language's magic method. – abarnert May 05 '15 at 04:14

4 Answers4

8

For arrays with one element, the array's truth value is determined by the truth value of that element.

The main point to make is that np.array(['']) is not an array containing one empty Python string. This array is created to hold strings of exactly one byte each and NumPy pads strings that are too short with the null character. This means that the array is equal to np.array(['\0']).

In this regard, NumPy is being consistent with Python which evaluates bool('\0') as True.

In fact, the only strings which are False in NumPy arrays are strings which do not contain any non-whitespace characters ('\0' is not a whitespace character).

Details of this Boolean evaluation are presented below.


Navigating NumPy's labyrinthine source code is not always easy, but we can find the code governing how values in different datatypes are mapped to Boolean values in the arraytypes.c.src file. This will explain how bool(a), bool(b), bool(c) and bool(d) are determined.

Before we get to the code in that file, we can see that calling bool() on a NumPy array invokes the internal _array_nonzero() function. If the array is empty, we get False. If there are two or more elements we get an error. But if the array has exactly one element, we hit the line:

return PyArray_DESCR(mp)->f->nonzero(PyArray_DATA(mp), mp);

Now, PyArray_DESCR is a struct holding various properties for the array. f is a pointer to another struct PyArray_ArrFuncs that holds the array's nonzero function. In other words, NumPy is going to call upon the array's own special nonzero function to check the Boolean value of that one element.

Determining whether an element is nonzero or not is obviously going to depend on the datatype of the element. The code implementing the type-specific nonzero functions can be found in the "nonzero" section of the arraytypes.c.src file.

As we'd expect, floats, integers and complex numbers are False if they're equal with zero. This explains bool(a). In the case of object arrays, None is similarly going to be evaluated as False because NumPy just calls the PyObject_IsTrue function. This explains bool(b).

To understand the results of bool(c) and bool(d), we see that the nonzero function for string type arrays is mapped to the STRING_nonzero function:

static npy_bool
STRING_nonzero (char *ip, PyArrayObject *ap)
{
    int len = PyArray_DESCR(ap)->elsize; // size of dtype (not string length)
    int i;
    npy_bool nonz = NPY_FALSE;

    for (i = 0; i < len; i++) {
        if (!Py_STRING_ISSPACE(*ip)) {   // if it isn't whitespace, it's True
            nonz = NPY_TRUE;
            break;
        }
        ip++;
    }
    return nonz;
}

(The unicode case is more or less the same idea.)

So in arrays with a string or unicode datatype, a string is only False if it contains only whitespace characters:

>>> bool(np.array([' ']))
False

In the case of array c in the question, there is a really a null character \0 padding the seemingly-empty string:

>>> np.array(['']) == np.array(['\0'])
array([ True], dtype=bool)

The STRING_nonzero function sees this non-whitespace character and so bool(c) is True.

As noted at the start of this answer, this is consistent with Python's evaluation of strings containing a single null character: bool('\0') is also True.


Update: Wim has fixed the behaviour detailed above in NumPy's master branch by making strings which contain only null characters, or a mix of only whitespace and null characters, evaluate to False. This means that NumPy 1.10+ will see that bool(np.array([''])) is False, which is much more in line with Python's treatment of "empty" strings.

Community
  • 1
  • 1
Alex Riley
  • 169,130
  • 45
  • 262
  • 238
  • `np.array([' '])` is false.. hilarious! – wim Jun 14 '15 at 14:22
  • It certainly seems peculiar! Trying to find out if there's any reason for it, will update answer if I find anything... – Alex Riley Jun 14 '15 at 15:37
  • *"So in arrays with a string or unicode datatype, a string is only False if it is made of whitespace"* I don't understand how that follows. I see that code instead as saying *"a string is only True if it has a non-whitespace character"* – wim Jun 14 '15 at 16:59
  • I think `len` in this context is the value of [`elsize`](http://docs.scipy.org/doc/numpy/reference/c-api.types-and-structures.html#c.PyArray_Descr.elsize), which is the size of the datatype. Both `np.array([' '])` and `np.array([' '])` are created with a dtype of `' – Alex Riley Jun 14 '15 at 17:10
  • The second point you raise may be clumsy wording on my part - you've put it better. If there is at least one non-whitespace character, the string is `True`. If every character is a whitespace character then `nonz` is not changed to `True` and the function returns `False`. – Alex Riley Jun 14 '15 at 17:15
  • so what lives at *ip when the length of the string is actually 0? would it be a null terminator ? – wim Jun 14 '15 at 17:18
  • it seems `np.array(['\0']) == np.array([''])` evaluates to `array([ True], dtype=bool)` for me – wim Jun 14 '15 at 17:26
  • That's certainly the most logical explanation for what the seemingly empty string contains. On a somewhat related note, I guess if we know the width of a string in advance (like in arrays) we don't really need the null terminator so `np.array([' '])` doesn't contain one - it's just used a placeholder character if empty strings are passed in. – Alex Riley Jun 14 '15 at 17:45
  • I don't really like the way numpy is handling strings here. If you set `x = ' \0'` then you get `np.array([x])[0] != x`. In python you have `len(x) == 2` but in `a = np.array([x])` suddenly `len(a[0]) == 1`. – wim Jun 14 '15 at 18:01
  • That feels awkward, I agree. Regarding why NumPy treats `''` as true, I speculate that it's because Python treats `bool('\0')` as true. There is no such thing as a truly empty string in an array, so NumPy is just being consistent with Python here. – Alex Riley Jun 14 '15 at 18:42
  • usually if `a == b` and `a == c` then `b == c`, but here we have `array(['']) == ''` and `array([''] == '\0'` , with of course `'' != '\0'` – wim Jun 15 '15 at 00:03
  • Yes it seems when comparing string types, NumPy ignores null bytes at the end of the string (Python doesn't do this). I think this makes sense because it allows strings in different sized-dtype arrays to compare properly, e.g. `array(['a'], dtype='S2') == array(['a'], dtype='S5')`. I guess your example highlights the need to be wary when comparing Python strings and string arrays because the transitivity of `==` may fail... – Alex Riley Jun 15 '15 at 16:30
  • This seems like a good explanation of where in the source code to find the cause of the behavior, but it raises a lot of questions about why this behavior would ever make sense. `len(np.array([''])[0])` gives zero, so it's absurd to say that the string "isn't empty". The null byte in this "empty" string is not accessible in any way except for its awkward surfacing in this boolean context. It would make for more sense for the null terminating byte to be totally ignored for all computations. – BrenBarn Jun 16 '15 at 07:07
  • Yes, as your comment implies, NumPy's definition of an "empty" string is not the same as Python's. I suspect the rationale behind this quirk is to allow strings in a NumPy array of type, say `'S5'`, to have "length" less than 5. The reasonable way to do this is to have NumPy's string length function, `np.char.str_len`, ignore any trailing null characters. Python's built in `len` function doesn't do this for Python strings. Perhaps it would make more sense for the nonzero function for strings to be implemented in terms of `np.char.str_len` rather than whitespace. – Alex Riley Jun 16 '15 at 10:05
  • 1
    Thanks for your insights ajcr. It helped greatly in writing the fix -> https://github.com/numpy/numpy/pull/5967 – wim Jun 16 '15 at 11:15
  • I'm glad I could help. The fix looks good to me and certainly makes much more sense than just looking for whitespace. – Alex Riley Jun 16 '15 at 11:45
7

I'm pretty sure the answer is, as explained in Scalars, that:

Array scalars have the same attributes and methods as ndarrays. [1] This allows one to treat items of an array partly on the same footing as arrays, smoothing out rough edges that result when mixing scalar and array operations.

So, if it's acceptable to call bool on a scalar, it must be acceptable to call bool on an array of shape (1,), because they are, as far as possible, the same thing.

And, while it isn't directly said anywhere in the docs that I know of, it's pretty obvious from the design that NumPy's scalars are supposed to act like native Python objects.

So, that explains why np.array([0]) is falsey rather than truthy, which is what you were initially surprised about.


So, that explains the basics. But what about the specifics of case c?

First, note that your array np.array(['']) is not an array of one Python object, but an array of one NumPy <U1 null-terminated character string of length 1. Fixed-length-string values don't have the same truthiness rule as Python strings—and they really couldn't; for a fixed-length-string type, "false if empty" doesn't make any sense, because they're never empty. You could argument about whether NumPy should have been designed that way or not, but it clearly does follow that rule consistently, and I don't think the opposite rule would be any less confusing here, just different.

But there seems to be something else weird going on with strings. Consider this:

>>> np.array(['a', 'b']) != 0
True

That's not doing an elementwise comparison of the <U2 strings to 0 and returning array([True, True]) (as you'd get from np.array(['a', 'b'], dtype=object)), it's doing an array-wide comparison and deciding that no array of strings is equal to 0, which seems odd… I'm not sure whether this deserves a separate answer here or even a whole separate question, but I am pretty sure I'm not going to be the one who writes that answer, because I have no clue what's going on here. :)


Beyond arrays of shape (1,), arrays of shape () are treated the same way, but anything else is a ValueError, because otherwise it would be very easily to misuse arrays with and and other Python operators that NumPy can't automagically convert into elementwise operations.

I personally think being consistent with other arrays would be more useful than being consistent with scalars here—in other words, just raise a ValueError. I also think that, if being consistent with scalars were important here, it would be better to be consistent with the unboxed Python values. In other words, if bool(array([v])) and bool(array(v)) are going to be allowed at all, they should always return exactly the same thing as bool(v), even if that's not consistent with np.nonzero. But I can see the argument the other way.

abarnert
  • 354,177
  • 51
  • 601
  • 671
  • 2
    About the truthiness of empty NumPy-array strings, I must say that I don't think that it should be confusing that length-0 strings are empty: as a user, I take the NumPy string type as "strings of some _maximum_ length". The fact that they can have a fixed length in memory is only an implementation detail. – Eric O. Lebigot May 05 '15 at 04:15
  • @EOL: Well, they're only variable-length if you take them as null-terminated… which they are, but that doesn't fit in well with the Python string model, where `'\0'` is a perfectly valid non-empty length-1 string, not the same empty string as `''`. – abarnert May 05 '15 at 04:18
  • I would again say that being null terminated is an implementation detail. :) For example, to tell you the truth, I was imagining that each string contained an additional size field (which would arguably be a little wasteful for strings longer than 255 characters). :) So, I still argue that the semantics of an empty string is quite obvious (i.e., that it is… empty). This is exactly why @wim's question is relevant, I think. – Eric O. Lebigot May 05 '15 at 04:20
  • 1
    I have a feeling your explanation is right, but I don't quite see how the string scalars are different from the other scalars in this situation. How do you create a "fixed-length string scalar"? A related strange behavior is that `a ! = ` and `b != 0` produce array results, but `c != 0` just produces the single value `True`. I think numpy is doing some kind of type-comparison trickery which causes it to use a different rule for comparing string arrays than for other types. – BrenBarn May 05 '15 at 04:22
  • @EOL: Well, I don't know how to go further with this; it's all down to whether you see NumPy arrays as in effect wrapping C-type "simple values" or as in effect wrapping Python objects boxed in those simple values. But at any rate, I don't think it's really relevant here; the fact is that NumPy's scalar types' bool rules _are_ different from Python's, whether you want them to be or not, and those are the rules that are applied. – abarnert May 05 '15 at 04:24
  • @BrenBarn: You construct a scalar from `dtype(' – abarnert May 05 '15 at 04:25
  • @abarnert Can you elaborate on these "NumPy rules"? I would be happy to understand the end of your post, which seems related: what is the "dtype scalar values" thing that you are mentioning, in the context of the original post? – Eric O. Lebigot May 05 '15 at 04:28
  • 1
    @BrenBarn: But I think you maybe onto something with the `c != 0`. Maybe it's treating a string scalar as an array-like collection of its characters? – abarnert May 05 '15 at 04:28
  • @BrenBarn: Well, that doesn't seem to be it, but… `np.array(['a', 'a']) != 0` returns a single `True` value rather than `array([True, True])` as well, so you're _definitely_ on to something… I'm just not sure yet what it is. – abarnert May 05 '15 at 04:29
  • @EOL: `np.nonzero(x)` is (element-wise) true for numbers that aren't zero, and for everything that isn't a number. That's not quite the same as `bool(x)`, which is (non-element-wise) true for numbers that aren't zero, and for non-empty collections, and for everything that isn't a number or collection. – abarnert May 05 '15 at 05:03
  • @abarnert Thanks for the explanation, but `np.nonzero()` is actually element-wise _false_ for `None`, which contradicts what you are describing, as far as I understand. In effect, I still don't understand NumPy's logic from your description (the last comment of your answer): it still seems strange, as highlighted in @wim's question. – Eric O. Lebigot May 05 '15 at 07:20
  • @EOL: I don't understand what you're saying. The only way you can store `None` in a NumPy array is with dtype `object`; you can't put it in any of NumPy's own types. – abarnert May 05 '15 at 07:30
  • "arrays of shape `()` are always false" - are you sure? I just tried `bool(numpy.array(3))` and got `True`. – user2357112 May 05 '15 at 08:41
  • @user2357112: Sorry, you're right; they're treated as scalars too. I'll edit the answer. – abarnert May 05 '15 at 08:50
  • @abarnert I was trying to make sense of your "np.nonzero(x) is (element-wise) true for numbers that aren't zero, and for everything that isn't a number." What I don't understand is that `nonzero()` returns an empty array (no nonzero element) for `None`, `[None]` and `array([None])`, then: since `None` "isn't a number", I understand that you were saying that `nonzero()` is "true" for `None`, so it should indicate that it is non-zero, which is not the case (empty array returned by `nonzero()`). If you still don't see what I mean, we can stop here, no problem: the discussion is long already. :) – Eric O. Lebigot May 05 '15 at 11:40
  • @EOL: For `object` dtype, NumPy's just defers to Python on whether something is truthy. For NumPy's own native types, it has its own rules. Do I need to edit that into the answer? – abarnert May 05 '15 at 18:49
  • 1
    @abarnert Ah, thanks! that's way clearer this way, for me. :) This wording could usefully replace the current one in your answer, I think. Now, how do we see what the (NumPy) truth value of NumPy strings is? extracting them from a NumPy array yields the expected (Python) behavior, and nothing weird, contrary to what array `c` in the original post seems to imply. – Eric O. Lebigot May 06 '15 at 02:25
  • @EOL: This wording doesn't seem to correspond to anything in the answer; it replaces the wording in a comment that explains a side issue that came up from another comment explaining another side issue… some of which may be important to _add_ to the question, but I'm not sure at this point… – abarnert May 06 '15 at 02:38
  • The casting is pretty weird, whether you use `== 0` or `== '0'` can also influence whether you get ndarray or bool output – wim May 06 '15 at 03:12
  • @wim: But only strings are weird. And I'm not sure exactly what's weird about them. – abarnert May 06 '15 at 03:15
  • I thought the pattern was that if there was implicit casting going on, you would get the boolean output, and for "same" types you get the ndarray output. So `a == 0` and `c == ''` both give you an ndarray. But `a == '0'` and `c == 0` both give you a bool. However, that idea went out the window because `b` behaves the exact opposite. Sighs – wim May 06 '15 at 04:14
  • @abarnert Thanks. It would be useful if you could expand on the end of your answer: I can't figure out what you mean by "a better way to do that would be to be consistent with … *rather than the dtype scalar values (the equivalent of np.nonzero(self[0]))*." – Eric O. Lebigot May 07 '15 at 01:59
  • @EOL: OK, is the new version of the last paragraph better? (It's really just my personal opinion, which may not be worth that much, anyway…) – abarnert May 07 '15 at 02:03
  • Thanks, I see what you mean, now. (I agree, too.) – Eric O. Lebigot May 07 '15 at 12:39
3

It's fixed in master now.

I thought this was a bug, and the numpy devs agreed, so this patch was merged earlier today. We should see new behaviour in the upcoming 1.10 release.

wim
  • 338,267
  • 99
  • 616
  • 750
2

Numpy seems to be following the same castings as builtin python**, in this context it seems to be because of which return true for calls to nonzero. Apparently len can also be used, but here, none of these arrays are empty (length 0) - so that's not directly relevant. Note that calling bool([False]) also returns True according to these rules.

a = np.array([0])
b = np.array([None])
c = np.array([''])

>>> nonzero(a)
(array([], dtype=int64),)
>>> nonzero(b)
(array([], dtype=int64),)
>>> nonzero(c)
(array([0]),)

This also seems consistent with the more enumerative description of bool casting --- where your examples are all explicitly discussed.

Interestingly, there does seem to be systematically different behavior with string arrays, e.g.

>>> a.astype(bool)
array([False], dtype=bool)
>>> b.astype(bool)
array([False], dtype=bool)
>>> c.astype(bool)
ERROR: ValueError: invalid literal for int() with base 10: ''

I think, when numpy converts something into a bool it uses the PyArray_BoolConverter function which, in turn, just calls the PyObject_IsTrue function --- i.e. the exact same function that builtin python uses, which is why numpy's results are so consistent.

DilithiumMatrix
  • 17,795
  • 22
  • 77
  • 119
  • I suspect wim knows this, and is asking why NumPy is using its own `nonzero` instead of Python's `bool` when the only point of letting (1,)-shape arrays respond to `bool` is to let them transparently act like scalars. – abarnert May 05 '15 at 03:53
  • 1
    Never mind, from your link, you don't even understand that this is about NumPy. – abarnert May 05 '15 at 03:54
  • `bool([False])` may evaluate to `True`, but so do `bool([0])`, `bool([None])`, and `bool([''])`. The question is, why does `numpy` treat the empty string differently from other falsey values in this context? – TigerhawkT3 May 05 '15 at 04:00
  • @zhermes: None of the examples are explicitly discussed there, since all of the examples in the question are about **numpy arrays**. – BrenBarn May 05 '15 at 04:01
  • 2
    zhermes, you are very close to making a correct point (even though it does not explain _why_ NumPy is choosing the convention that @wim observed): the documentation that you quote (along with the definition of `bool()`) indicates that `__nonzero__()` is used in this case (a little bit ambiguously, since `__len__()` could also be used). You are using `numpy.nonzero()` where you should be using `__nonzero__()`. – Eric O. Lebigot May 05 '15 at 04:04
  • 3
    Python doesn't even _have_ a function called `nonzero`. There is a `__nonzero__` magic method (in 2.x only…), but the function that calls it is `bool`. What you've done is imported a somewhat-related but not actually-related-to-this-problem NumPy function whose name is confusing you. – abarnert May 05 '15 at 04:04
  • In the edited version, your first sentence is not true, and in fact that's why wim noticed this in the first place. NumPy does _not_ follow the same rules for its scalar typs as Python does for its native types. And your link is now even more misleading, because it's still mixing up Python's `__nonzero__` and NumPy's `nonzero`, and it looks like a link that's relevant to the latter but is actually relevant to the former. – abarnert May 05 '15 at 04:26