73

In the context of unit testing some functions, I'm trying to establish the equality of 2 DataFrames using python pandas:

ipdb> expect
                            1   2
2012-01-01 00:00:00+00:00 NaN   3
2013-05-14 12:00:00+00:00   3 NaN

ipdb> df
identifier                  1   2
timestamp
2012-01-01 00:00:00+00:00 NaN   3
2013-05-14 12:00:00+00:00   3 NaN

ipdb> df[1][0]
nan

ipdb> df[1][0], expect[1][0]
(nan, nan)

ipdb> df[1][0] == expect[1][0]
False

ipdb> df[1][1] == expect[1][1]
True

ipdb> type(df[1][0])
<type 'numpy.float64'>

ipdb> type(expect[1][0])
<type 'numpy.float64'>

ipdb> (list(df[1]), list(expect[1]))
([nan, 3.0], [nan, 3.0])

ipdb> df1, df2 = (list(df[1]), list(expect[1])) ;; df1 == df2
False

Given that I'm trying to test the entire of expect against the entire of df, including NaN positions, what am I doing wrong?

What is the simplest way to compare equality of Series/DataFrames including NaNs?

Steve Pike
  • 1,454
  • 2
  • 12
  • 14

7 Answers7

61

You can use assert_frame_equals with check_names=False (so as not to check the index/columns names), which will raise if they are not equal:

In [11]: from pandas.testing import assert_frame_equal

In [12]: assert_frame_equal(df, expected, check_names=False)

You can wrap this in a function with something like:

try:
    assert_frame_equal(df, expected, check_names=False)
    return True
except AssertionError:
    return False

In more recent pandas this functionality has been added as .equals:

df.equals(expected)
cs95
  • 379,657
  • 97
  • 704
  • 746
Andy Hayden
  • 359,921
  • 101
  • 625
  • 535
  • I accepted this one rather than the others (which are useful!) since the real crux of my question was about comparing DataFrames - I've edited it a little to make this more clear. – Steve Pike Oct 11 '13 at 16:28
  • FYI, this is the slowest method as its done recursively and not vectorized (we use it for testing) – Jeff Oct 11 '13 at 16:29
  • 1
    @StevePike Not sure I understand this: "...since the real crux of my question was about comparing DataFrames". All solutions present at the moment are showing ways to compare `DataFrames`. *Validating* them is a slightly different question. Maybe I'm being a bit pedantic. – Phillip Cloud Oct 11 '13 at 16:37
  • 1
    This gives a `FutureWarning` when the objects are not equal: `print assert_frame_equal(X, df)` gives output `None` and warning `/usr/local/lib/python2.7/site-packages/numpy/core/numeric.py:2367: FutureWarning: numpy equal will not check object identity in the future. The comparison did not return the same result as suggested by the identity (`is`)) and will change. return bool(asarray(a1 == a2).all())` – FooBar Mar 12 '15 at 15:12
  • @FooBar Interesting, I expect this will be fixed in pandas 0.16. (out soon) – Andy Hayden Mar 14 '15 at 00:59
  • @FooBar which version of numpy/pandas are you using? I just updated to latest numpy (1.9.2) and do not see this message. Pandas 0.15.2. – Andy Hayden Mar 14 '15 at 08:39
  • It appears that `check_names=False` ignores neither column nor index differences. Even with it set I get `AssertionError: DataFrame.index are different` and when I set the same index I still get `AssertionError: DataFrame.columns are different`. Only if I set the same index and columns then then everything is ok. Not sure what `check_names` is supposed to control. – kilgoretrout Oct 30 '15 at 22:27
  • @river_jones `check_names` compares the `df.index.names` and `df.columns.names` attributes – Andy Hayden Oct 30 '15 at 22:49
48

One of the properties of NaN is that NaN != NaN is True.

Check out this answer for a nice way to do this using numexpr.

(a == b) | ((a != a) & (b != b))

says this (in pseudocode):

a == b or (isnan(a) and isnan(b))

So, either a equals b, or both a and b are NaN.

If you have small frames then assert_frame_equal will be okay. However, for large frames (10M rows) assert_frame_equal is pretty much useless. I had to interrupt it, it was taking so long.

In [1]: df = DataFrame(rand(1e7, 15))

In [2]: df = df[df > 0.5]

In [3]: df2 = df.copy()

In [4]: df
Out[4]:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 10000000 entries, 0 to 9999999
Columns: 15 entries, 0 to 14
dtypes: float64(15)

In [5]: timeit (df == df2) | ((df != df) & (df2 != df2))
1 loops, best of 3: 598 ms per loop

timeit of the (presumably) desired single bool indicating whether the two DataFrames are equal:

In [9]: timeit ((df == df2) | ((df != df) & (df2 != df2))).values.all()
1 loops, best of 3: 687 ms per loop
Community
  • 1
  • 1
Phillip Cloud
  • 24,919
  • 11
  • 68
  • 88
11

Like @PhillipCloud answer, but more written out

In [26]: df1 = DataFrame([[np.nan,1],[2,np.nan]])

In [27]: df2 = df1.copy()

They really are equivalent

In [28]: result = df1 == df2

In [29]: result[pd.isnull(df1) == pd.isnull(df2)] = True

In [30]: result
Out[30]: 
      0     1
0  True  True
1  True  True

A nan in df2 that doesn't exist in df1

In [31]: df2 = DataFrame([[np.nan,1],[np.nan,np.nan]])

In [32]: result = df1 == df2

In [33]: result[pd.isnull(df1) == pd.isnull(df2)] = True

In [34]: result
Out[34]: 
       0     1
0   True  True
1  False  True

You can also fill with a value you know not to be in the frame

In [38]: df1.fillna(-999) == df1.fillna(-999)
Out[38]: 
      0     1
0  True  True
1  True  True
Jeff
  • 125,376
  • 21
  • 220
  • 187
  • 3
    If you have two non-equal values I think this will give True since they are both nonnull :s – Andy Hayden Oct 11 '13 at 16:38
  • 2
    @Andy Hayden is correct: `result[pd.isnull(df1) == pd.isnull(df2)] = True` should be `result[pd.isnull(df1) & pd.isnull(df2)] = True` – Carl Dec 04 '20 at 00:12
6

Any equality comparison using == with np.NaN is False, even np.NaN == np.NaN is False.

Simply, df1.fillna('NULL') == df2.fillna('NULL'), if 'NULL' is not a value in the original data.

To be safe, do the following:

Example a) Compare two dataframes with NaN values

bools = (df1 == df2)
bools[pd.isnull(df1) & pd.isnull(df2)] = True
assert bools.all().all()

Example b) Filter rows in df1 that do not match with df2

bools = (df1 != df2)
bools[pd.isnull(df1) & pd.isnull(df2)] = False
df_outlier = df1[bools.all(axis=1)]

(Note: this is wrong - bools[pd.isnull(df1) == pd.isnull(df2)] = False)

Lydia
  • 2,377
  • 16
  • 13
  • 2
    Your answer would look better if you add a bit more text to your answer, maybe explaining what you are doing. – KurzedMetal Aug 20 '15 at 19:04
  • This strategy is faster than the accepted answer for small cases, but much slower for large ones – Andrew May 19 '20 at 22:45
4
df.fillna(0) == df2.fillna(0)

You can use fillna(). Documenation here.

from pandas import DataFrame

# create a dataframe with NaNs
df = DataFrame([{'a': 1, 'b': 2}, {'a': 5, 'b': 10, 'c': 20}])
df2 = df

# comparison fails!
print df == df2

# all is well 
print df.fillna(0) == df2.fillna(0)
stephentgrammer
  • 470
  • 6
  • 16
  • 0 feels like a dangerous thing to fill with in general, but filling with some unique string or identifier sounds like a good strategy. This strategy is faster than the accepted answer for small cases, but much slower for large ones – Andrew May 19 '20 at 22:45
1

Ever since pandas 1.1.0 there is df.compare(expect), which provides detailed results if any differences occur.

Michel de Ruiter
  • 7,131
  • 5
  • 49
  • 74
0

Michel de Ruiter's answer should be the current answer. Posting an answer because I don't have enough reputation to add a comment to it. Use the dataframe compare method.

df.compare(expect) will return an empty dataframe (all axis are 0-length) if df and expect are equal, including NaN locations. You can use the empty dataframe property on the result.

assert(df.compare(expect).empty))
  • This does not provide an answer to the question. Once you have sufficient [reputation](https://stackoverflow.com/help/whats-reputation) you will be able to [comment on any post](https://stackoverflow.com/help/privileges/comment); instead, [provide answers that don't require clarification from the asker](https://meta.stackexchange.com/questions/214173/why-do-i-need-50-reputation-to-comment-what-can-i-do-instead). - [From Review](/review/late-answers/34727116) – doneforaiur Jul 25 '23 at 09:39