1

I have a list of values. Some of the values are corrupt and represented as 'nan'. After manipulating some of the data, the 'nan' propagates, as expected and intended.

In the manipulated dataset, I want to find the number of useless values. Intuitively, I use the method .count(nan), but to my surprise, and without warning, only the 'unmanipulated' nans are counted.

I found no immediate answer in docs.python math.nan and the documentation of the list.count(x) method is not very precise:

Return the number of times x appears in the list.

from math import nan, isnan

list1 = [nan]

myitem1 = list1[0]
myitem2 = list1[0] + 1 # common operation: extract a value from a list
print(myitem2) # nan: looks like nan
print(isnan(myitem2)) # True: is nan

list2 = [1, nan, myitem1, myitem2]
count1 = list2.count(nan)
count2 = sum(isnan(e) for e in list2)
print(count1, count2)  # 2, 3: doesn't always count as nan
mkrieger1
  • 19,194
  • 5
  • 54
  • 65
sch
  • 11
  • 2

1 Answers1

0

While myitem2 has a NaN value, it is a different float instance than math.nan, because the + operator always returns a new object, even if it has the same value.

list2.count(nan) only returns the number of times math.nan is contained in the list.

It helps to look at the object IDs (example values, they are different each time you run the code):

>>> id(nan)
140305278866152
>>> id(myitem2)
140305278866176
>>> [id(x) for x in list2]
[9079008, 140305278866152, 140305278866152, 140305278866176]
#   1           nan              nan            myitem2

Now you might ask,

>>> a = 5.0
>>> b = a + 0
>>> list3 = [a, b]
>>> id(a)
140133123191696
>>> id(b)
140133123191504
>>> [id(x) for x in list3]
[140133123191696, 140133123191504]
>>> list3.count(a)
2

why does this not return 1, as a and b are two different objects with the same value?

The explanation is that count in fact first compares every list item to its argument by identity, but if it is different, it then compares it by value.

I didn't find where this is specified, but here is the CPython implementation (comments added by me):

static PyObject *
list_count(PyListObject *self, PyObject *value)
{
    Py_ssize_t count = 0;
    Py_ssize_t i;

    for (i = 0; i < Py_SIZE(self); i++) {
        PyObject *obj = self->ob_item[i];
        // comparison by identity
        if (obj == value) {
           count++;
           continue;
        }
        Py_INCREF(obj);
        // comparison by value
        int cmp = PyObject_RichCompareBool(obj, value, Py_EQ);
        Py_DECREF(obj);
        if (cmp > 0)
            count++;
        else if (cmp < 0)
            return NULL;
    }
    return PyLong_FromSsize_t(count);
}

Finally, you need to know that comparing anything to NaN by value always returns "not equal", even if the other value is also NaN (I also didn't find where this is specified for Python, but see Why does comparing to nan yield False (Python)? and What is the rationale for all comparisons returning false for IEEE754 NaN values?).

That's why myitem2 is not included in .count(nan), but b is included in .count(a).

mkrieger1
  • 19,194
  • 5
  • 54
  • 65