16

While nan == nan is always False, in many cases people want to treat them as equal, and this is enshrined in pandas.DataFrame.equals:

NaNs in the same location are considered equal.

Of course, I can write

def equalp(x, y):
    return (x == y) or (math.isnan(x) and math.isnan(y))

However, this will fail on containers like [float("nan")] and isnan barfs on non-numbers (so the complexity increases).

So, what do people do to compare complex Python objects which may contain nan?

PS. Motivation: when comparing two rows in a pandas DataFrame, I would convert them into dicts and compare dicts element-wise.

PPS. When I say "compare", I am thinking diff, not equalp.

juanpa.arrivillaga
  • 88,713
  • 10
  • 131
  • 172
sds
  • 58,617
  • 29
  • 161
  • 278
  • If you're asking what people do... then the answer is, they usually don't. Having non-scalar/object columns is usually considered bad form, and introduces a lot of headaches you could otherwise avoid by flattening your data a bit. It's also a less-performant option. – cs95 Jan 25 '18 at 22:31
  • @cᴏʟᴅsᴘᴇᴇᴅ I think they mean when outside of pandas containers, like lists with `float('nan')` in them. – juanpa.arrivillaga Jan 25 '18 at 22:31
  • 2
    I think *most* people just accept that Python knows best and `NaN != NaN`. Or try to avoid having NaN altogether. – Mark Ransom Jan 25 '18 at 22:34
  • Hmm, in that case, are your lists always integers or floats? – cs95 Jan 25 '18 at 22:37
  • 1
    Yeah, at this point, you might as well use something like `NAN = object()` then replace `float('nan')` with `NAN` – juanpa.arrivillaga Jan 25 '18 at 22:37
  • @sds why would you do this? "Motivation: when comparing two rows in a pandas DataFrame, I would convert them into dicts and compare dicts element-wise." – juanpa.arrivillaga Jan 25 '18 at 23:02
  • @sds: Like juanpa said, do you really need the dict (maybe for other operations)? There is also `df.as_matrix()` which would make things easier. – ascripter Jan 25 '18 at 23:14
  • @juanpa.arrivillaga: how would you compare two rows or length 400? – sds Jan 26 '18 at 02:32
  • @sds `df.iloc[1,:].equals(df.iloc[2:])`? – juanpa.arrivillaga Jan 26 '18 at 02:38
  • @juanpa.arrivillaga: okay, I got `False`. How do I get the list of columns where the rows are different? – sds Jan 26 '18 at 02:40

3 Answers3

11

Suppose you have a data-frame with nan values:

In [10]: df = pd.DataFrame(np.random.randint(0, 20, (10, 10)).astype(float), columns=["c%d"%d for d in range(10)])

In [10]: df.where(np.random.randint(0,2, df.shape).astype(bool), np.nan, inplace=True)

In [10]: df
Out[10]:
     c0    c1    c2    c3    c4    c5    c6    c7   c8    c9
0   NaN   6.0  14.0   NaN   5.0   NaN   2.0  12.0  3.0   7.0
1   NaN   6.0   5.0  17.0   NaN   NaN  13.0   NaN  NaN   NaN
2   NaN  17.0   NaN   8.0   6.0   NaN   NaN  13.0  NaN   NaN
3   3.0   NaN   NaN  15.0   NaN   8.0   3.0   NaN  3.0   NaN
4   7.0   8.0   7.0   NaN   9.0  19.0   NaN   0.0  NaN  11.0
5   NaN   NaN  14.0   2.0   NaN   NaN   0.0   NaN  NaN   8.0
6   3.0  13.0   NaN   NaN   NaN   NaN   NaN  12.0  3.0   NaN
7  13.0  14.0   NaN   5.0  13.0   NaN  18.0   6.0  NaN   5.0
8   3.0   9.0  14.0  19.0  11.0   NaN   NaN   NaN  NaN   5.0
9   3.0  17.0   NaN   NaN   0.0   NaN  11.0   NaN  NaN   0.0

And you want to compare rows, say, row 0 and 8. Then just use fillna and do vectorized comparison:

In [12]: df.iloc[0,:].fillna(0) != df.iloc[8,:].fillna(0)
Out[12]:
c0     True
c1     True
c2    False
c3     True
c4     True
c5    False
c6     True
c7     True
c8     True
c9     True
dtype: bool

You can use the resulting boolean array to index into the columns, if you just want to know which columns are different:

In [14]: df.columns[df.iloc[0,:].fillna(0) != df.iloc[8,:].fillna(0)]
Out[14]: Index(['c0', 'c1', 'c3', 'c4', 'c6', 'c7', 'c8', 'c9'], dtype='object')
juanpa.arrivillaga
  • 88,713
  • 10
  • 131
  • 172
  • 1
    This gives you False for entries that were NaN in one column and 0 in the other – Diego F Medina Jul 17 '23 at 16:22
  • @DiegoFMedina yes, totally. I was being too clever for my own good. This will only work if you know a good value to use for a fillvalue (due to the nature of your data). Alternatively, you can do something like: `row0 = df.iloc[0, :]; row8 = df.iloc[8,:];` then `(row0 == row8) | (row0.isnull() & row8.isnull())` to find the columns that are equal treating NaNs as equal – juanpa.arrivillaga Jul 17 '23 at 16:44
4

I assume you have array-data or can at least convert to a numpy array?

One way is to mask all the nans using a numpy.maarray, then comparing the arrays. So your starting situation would be sth. like this

import numpy as np
import numpy.ma as ma
arr1 = ma.array([3,4,6,np.nan,2])
arr2 = ma.array([3,4,6,np.nan,2])

print arr1 == arr2
print ma.all(arr1==arr2)

>>> [ True  True  True False  True]
>>> False  # <-- you want this to show True

Solution:

arr1[np.isnan(arr1)] = ma.masked
arr2[np.isnan(arr2)] = ma.masked

print arr1 == arr2
print ma.all(arr1==arr2)

>>> [True True True -- True]
>>> True
ascripter
  • 5,665
  • 12
  • 45
  • 68
0

Here's a function that recurses into a data structure replacing nan values with a unique string. I wrote this for a unit test that compares data structures that may contain nan.

It's only designed for data structures made of dict and list, but it's easy to see how to expand it.

from math import isnan
from uuid import uuid4
from typing import Union

NAN_REPLACEMENT = f"THIS_WAS_A_NAN{uuid4()}"

def replace_nans(data_structure: Union[dict, list]) -> Union[dict, list]:
    if isinstance(data_structure, dict):
        iterme = data_structure.items()
    elif isinstance(data_structure, list):
        iterme = enumerate(data_structure)
    else:
        raise ValueError(
            "replace_nans should only be called on structures made of dicts and lists"
        )

    for key, value in iterme:
        if isinstance(value, float) and isnan(value):
            data_structure[key] = NAN_REPLACEMENT
        elif isinstance(value, dict) or isinstance(value, list):
            data_structure[key] = replace_nans(data_structure[key])
    return data_structure
Shaun Taylor
  • 326
  • 2
  • 6