4

I'm being driven crazy by a NumPy array of dtype obj with a missing value (in the example below, it is the penultimate value).

>> a
array([0, 3, 'Braund, Mr. Owen Harris', 'male', 22.0, 1, 0, 'A/5 21171',
       7.25, nan, 'S'], dtype=object)

I want to find this missing value programatically with a function that returns a boolean vector with True values in elements that correspond to missing values in the array (as per the example below).

>> some_function(a)
array([False, False, False, False, False, False, False, False, False, True, False],
      dtype=bool)

I tried isnan to no avail.

>> isnan(a)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: ufunc 'isnan' not supported for the input types, and the inputs could not
be safely coerced to any supported types according to the casting rule ''safe''

I also attempted performing the operation explicitly over every element of the array with apply_along_axis, but the same error is returned.

>> apply_along_axis(isnan, 0, a)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: ufunc 'isnan' not supported for the input types, and the inputs could not
be safely coerced to any supported types according to the casting rule ''safe''

Can anyone explain to me (1) what I'm doing wrong and (2) what I can do to solve this problem? From the error, I gather that it has to do with one of the elements not being in an appropriate type. What is the easiest way to get around this issue?

Gyan Veda
  • 6,309
  • 11
  • 41
  • 66
  • I don't think you can 'nan' an object – RickyA Sep 09 '14 at 20:35
  • You mean `isnan` an object? – Gyan Veda Sep 09 '14 at 20:39
  • If the `nan`s you're looking for are confined to that column, you could slice or index the array before applying `isnan`. You might also consider a [structured array](http://docs.scipy.org/doc/numpy/user/basics.rec.html) rather than an object array. – user2357112 Sep 09 '14 at 20:53

3 Answers3

4

Another workaround is:

In [148]: [item != item for item in a]
Out[148]: [False, False, False, False, False, False, False, False, False, True, False]

since NaNs are not equal to themselves. Note, however, that it is possible to define custom objects which, like NaN, are not equal to themselves:

class Foo(object):
    def __cmp__(self, obj):
        return -1
foo = Foo()
assert foo != foo

so using item != item does not necessarily mean item is a NaN.


Note that it is generally a good idea to avoid NumPy arrays of dtype object if possible.

  • They are not particularly quick -- operations on its contents generally devolve into Python calls on the underlying Python objects. A normal Python list often has better performance.
  • Unlike numeric arrays which can be more space efficient than a Python list of numbers, object arrays are not particularly space efficient since every item is a reference to a Python object.
  • They are also not particular convenient since many NumPy operations do not work on arrays of dtype object. isnan is one such example.
Community
  • 1
  • 1
unutbu
  • 842,883
  • 184
  • 1,785
  • 1,677
  • Thanks for your answer! Just curious, if you don't recommend NumPy arrays of dtype `object`, what do you suggest to store mixed data (i.e., numeric and string)? – Gyan Veda Sep 09 '14 at 21:01
  • I would use a Python list or tuple instead. – unutbu Sep 09 '14 at 21:06
0

I figured it out! List comprehension is the way to go.

The problem arises from the fact that isnan cannot be called on strings. Therefore, the trick is to iterate through the elements, performing the isnan operation on any elements that are NOT of the type string.

[isnan(i) if type(i) != str else False for i in a]
Gyan Veda
  • 6,309
  • 11
  • 41
  • 66
0

I suggest using Pandas.isna. Unlike the corresponding function in lumpy, this version handles missing string values.

s = np.array(['one', 'two', None, 'four'])
pd.isna(s)

The output:

array([False, False,  True, False])
Boris Gorelik
  • 29,945
  • 39
  • 128
  • 170