Pandas DataFrames with NaNs equality comparison

Question

In the context of unit testing some functions, I'm trying to establish the equality of 2 DataFrames using python pandas:

ipdb> expect
                            1   2
2012-01-01 00:00:00+00:00 NaN   3
2013-05-14 12:00:00+00:00   3 NaN

ipdb> df
identifier                  1   2
timestamp
2012-01-01 00:00:00+00:00 NaN   3
2013-05-14 12:00:00+00:00   3 NaN

ipdb> df[1][0]
nan

ipdb> df[1][0], expect[1][0]
(nan, nan)

ipdb> df[1][0] == expect[1][0]
False

ipdb> df[1][1] == expect[1][1]
True

ipdb> type(df[1][0])
<type 'numpy.float64'>

ipdb> type(expect[1][0])
<type 'numpy.float64'>

ipdb> (list(df[1]), list(expect[1]))
([nan, 3.0], [nan, 3.0])

ipdb> df1, df2 = (list(df[1]), list(expect[1])) ;; df1 == df2
False

Given that I'm trying to test the entire of expect against the entire of df, including NaN positions, what am I doing wrong?

What is the simplest way to compare equality of Series/DataFrames including NaNs?

score 61 · Accepted Answer · edited Jun 09 '19 at 00:29

61

You can use assert_frame_equals with check_names=False (so as not to check the index/columns names), which will raise if they are not equal:

In [11]: from pandas.testing import assert_frame_equal

In [12]: assert_frame_equal(df, expected, check_names=False)

You can wrap this in a function with something like:

try:
    assert_frame_equal(df, expected, check_names=False)
    return True
except AssertionError:
    return False

In more recent pandas this functionality has been added as .equals:

df.equals(expected)

edited Jun 09 '19 at 00:29

cs95

379,657
97
704
746

answered Oct 11 '13 at 16:17

Andy Hayden

359,921
101
625
535

I accepted this one rather than the others (which are useful!) since the real crux of my question was about comparing DataFrames - I've edited it a little to make this more clear. – Steve Pike Oct 11 '13 at 16:28
FYI, this is the slowest method as its done recursively and not vectorized (we use it for testing) – Jeff Oct 11 '13 at 16:29
1

@StevePike Not sure I understand this: "...since the real crux of my question was about comparing DataFrames". All solutions present at the moment are showing ways to compare `DataFrames`. *Validating* them is a slightly different question. Maybe I'm being a bit pedantic. – Phillip Cloud Oct 11 '13 at 16:37
1

This gives a `FutureWarning` when the objects are not equal: `print assert_frame_equal(X, df)` gives output `None` and warning `/usr/local/lib/python2.7/site-packages/numpy/core/numeric.py:2367: FutureWarning: numpy equal will not check object identity in the future. The comparison did not return the same result as suggested by the identity (`is`)) and will change. return bool(asarray(a1 == a2).all())` – FooBar Mar 12 '15 at 15:12
@FooBar Interesting, I expect this will be fixed in pandas 0.16. (out soon) – Andy Hayden Mar 14 '15 at 00:59
@FooBar which version of numpy/pandas are you using? I just updated to latest numpy (1.9.2) and do not see this message. Pandas 0.15.2. – Andy Hayden Mar 14 '15 at 08:39
It appears that `check_names=False` ignores neither column nor index differences. Even with it set I get `AssertionError: DataFrame.index are different` and when I set the same index I still get `AssertionError: DataFrame.columns are different`. Only if I set the same index and columns then then everything is ok. Not sure what `check_names` is supposed to control. – kilgoretrout Oct 30 '15 at 22:27
@river_jones `check_names` compares the `df.index.names` and `df.columns.names` attributes – Andy Hayden Oct 30 '15 at 22:49

score 48 · Answer 2 · edited May 23 '17 at 11:47

48

One of the properties of NaN is that NaN != NaN is True.

Check out this answer for a nice way to do this using numexpr.

(a == b) | ((a != a) & (b != b))

says this (in pseudocode):

a == b or (isnan(a) and isnan(b))

So, either a equals b, or both a and b are NaN.

If you have small frames then assert_frame_equal will be okay. However, for large frames (10M rows) assert_frame_equal is pretty much useless. I had to interrupt it, it was taking so long.

In [1]: df = DataFrame(rand(1e7, 15))

In [2]: df = df[df > 0.5]

In [3]: df2 = df.copy()

In [4]: df
Out[4]:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 10000000 entries, 0 to 9999999
Columns: 15 entries, 0 to 14
dtypes: float64(15)

In [5]: timeit (df == df2) | ((df != df) & (df2 != df2))
1 loops, best of 3: 598 ms per loop

timeit of the (presumably) desired single bool indicating whether the two DataFrames are equal:

In [9]: timeit ((df == df2) | ((df != df) & (df2 != df2))).values.all()
1 loops, best of 3: 687 ms per loop

edited May 23 '17 at 11:47

Community

1
1

answered Oct 11 '13 at 16:16

Phillip Cloud

24,919
11
68
88

presumably after you need to .all().all() ? – Andy Hayden Oct 11 '13 at 16:36
Much faster to use `np.all()` since this comparison guarantees a boolean result. It adds another ~100 ms to the running time. If you use the `all()` methods of pandas objects there's about a 3.5x perf hit. – Phillip Cloud Oct 11 '13 at 16:40
"Guarantees" meaning I'm assuming that someone is not using a subclass that could return `NaN` from `__eq__`/`__ne__`. – Phillip Cloud Oct 11 '13 at 16:41
2

you can do ``((df == df2) | ((df != df) & (df2 != df2))).values.ravel().all()`` – Jeff Oct 11 '13 at 16:43
I'll add that to the `timeit`. – Phillip Cloud Oct 11 '13 at 16:44
BTW, no need to `ravel` if no `axis` is passed. – Phillip Cloud Oct 11 '13 at 16:44
no because best and worst case are equal, all values are checked in either case – Phillip Cloud Oct 11 '13 at 16:46

score 11 · Answer 3 · answered Oct 11 '13 at 16:19

Like @PhillipCloud answer, but more written out

In [26]: df1 = DataFrame([[np.nan,1],[2,np.nan]])

In [27]: df2 = df1.copy()

They really are equivalent

In [28]: result = df1 == df2

In [29]: result[pd.isnull(df1) == pd.isnull(df2)] = True

In [30]: result
Out[30]: 
      0     1
0  True  True
1  True  True

A nan in df2 that doesn't exist in df1

In [31]: df2 = DataFrame([[np.nan,1],[np.nan,np.nan]])

In [32]: result = df1 == df2

In [33]: result[pd.isnull(df1) == pd.isnull(df2)] = True

In [34]: result
Out[34]: 
       0     1
0   True  True
1  False  True

You can also fill with a value you know not to be in the frame

In [38]: df1.fillna(-999) == df1.fillna(-999)
Out[38]: 
      0     1
0  True  True
1  True  True

If you have two non-equal values I think this will give True since they are both nonnull :s — Andy Hayden, Oct 11 '13 at 16:38
@Andy Hayden is correct: `result[pd.isnull(df1) == pd.isnull(df2)] = True` should be `result[pd.isnull(df1) & pd.isnull(df2)] = True` — Carl, Dec 04 '20 at 00:12

Lydia · Answer 4 · 2015-11-23T19:22:54.043

6

Any equality comparison using == with np.NaN is False, even np.NaN == np.NaN is False.

Simply, df1.fillna('NULL') == df2.fillna('NULL'), if 'NULL' is not a value in the original data.

To be safe, do the following:

Example a) Compare two dataframes with NaN values

bools = (df1 == df2)
bools[pd.isnull(df1) & pd.isnull(df2)] = True
assert bools.all().all()

Example b) Filter rows in df1 that do not match with df2

bools = (df1 != df2)
bools[pd.isnull(df1) & pd.isnull(df2)] = False
df_outlier = df1[bools.all(axis=1)]

(Note: this is wrong - bools[pd.isnull(df1) == pd.isnull(df2)] = False)

edited Nov 23 '15 at 19:22

answered Aug 20 '15 at 16:32

Lydia

2,377
16
13

2

Your answer would look better if you add a bit more text to your answer, maybe explaining what you are doing. – KurzedMetal Aug 20 '15 at 19:04
This strategy is faster than the accepted answer for small cases, but much slower for large ones – Andrew May 19 '20 at 22:45

score 4 · Answer 5 · answered Oct 01 '15 at 18:08

4

df.fillna(0) == df2.fillna(0)

You can use fillna(). Documenation here.

from pandas import DataFrame

# create a dataframe with NaNs
df = DataFrame([{'a': 1, 'b': 2}, {'a': 5, 'b': 10, 'c': 20}])
df2 = df

# comparison fails!
print df == df2

# all is well 
print df.fillna(0) == df2.fillna(0)

answered Oct 01 '15 at 18:08

stephentgrammer

470
6
16

0 feels like a dangerous thing to fill with in general, but filling with some unique string or identifier sounds like a good strategy. This strategy is faster than the accepted answer for small cases, but much slower for large ones – Andrew May 19 '20 at 22:45

score 1 · Answer 6 · answered Jun 29 '23 at 09:15

1

Ever since pandas 1.1.0 there is df.compare(expect), which provides detailed results if any differences occur.

answered Jun 29 '23 at 09:15

Michel de Ruiter

7,131
5
49
74

score 0 · Answer 7 · answered Jul 21 '23 at 18:05

0

Michel de Ruiter's answer should be the current answer. Posting an answer because I don't have enough reputation to add a comment to it. Use the dataframe compare method.

df.compare(expect) will return an empty dataframe (all axis are 0-length) if df and expect are equal, including NaN locations. You can use the empty dataframe property on the result.

assert(df.compare(expect).empty))

answered Jul 21 '23 at 18:05

Steven Brown

41
4

This does not provide an answer to the question. Once you have sufficient [reputation](https://stackoverflow.com/help/whats-reputation) you will be able to [comment on any post](https://stackoverflow.com/help/privileges/comment); instead, [provide answers that don't require clarification from the asker](https://meta.stackexchange.com/questions/214173/why-do-i-need-50-reputation-to-comment-what-can-i-do-instead). - [From Review](/review/late-answers/34727116) – doneforaiur Jul 25 '23 at 09:39

Pandas DataFrames with NaNs equality comparison

7 Answers7

Linked

Related