6

I tried all the solutions here: Pandas "Can only compare identically-labeled DataFrame objects" error

Didn't work for me. Here's what I've got. I have two data frames. One is a set of financial data that already exists in the system and another set that has some that may or may not exist in the system. I need to find the difference and add the stuff that doesn't exist.

Here is the code:

import pandas as pd
import numpy as np
from azure.storage.blob import AppendBlobService, PublicAccess, ContentSettings
from io import StringIO

dataUrl = "http://ichart.finance.yahoo.com/table.csv?s=MSFT"
blobUrlBase = "https://pyjobs.blob.core.windows.net/"
data = pd.read_csv(dataUrl)

abs = AppendBlobService(account_name='pyjobs', account_key='***')
abs.create_container("stocks", public_access = PublicAccess.Container)
abs.append_blob_from_text('stocks', 'msft', data[:25].to_csv(index=False))
existing = pd.read_csv(StringIO(abs.get_blob_to_text('stocks', 'msft').content))

ne = (data != existing).any(1)

the failing code is the final line. I was going through an article on determining differences between data frames.

I checked the dtypes on all columns, they appear to be the same. I also did a side by side output, I sorted teh axis, the indices, dropped the indices etc. Still get that bloody error.

Here is the output of the first row of existing and data

>>> existing[:1]
         Date       Open   High    Low  Close    Volume  Adj Close
0  2016-05-27  51.919998  52.32  51.77  52.32  17653700      52.32
>>> data[:1]
         Date       Open   High    Low  Close    Volume  Adj Close
0  2016-05-27  51.919998  52.32  51.77  52.32  17653700      52.32

Here is the exact error I receive:

>>> ne = (data != existing).any(1)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Anaconda3\lib\site-packages\pandas\core\ops.py", line 1169, in f
    return self._compare_frame(other, func, str_rep)
  File "C:\Anaconda3\lib\site-packages\pandas\core\frame.py", line 3571, in _compare_frame
    raise ValueError('Can only compare identically-labeled '
ValueError: Can only compare identically-labeled DataFrame objects
Community
  • 1
  • 1
David Crook
  • 2,722
  • 3
  • 23
  • 49

4 Answers4

14

In order to get around this, you want to compare the underlying numpy arrays.

import pandas as pd

df1 = pd.DataFrame([[1, 2], [3, 4]], columns=['A', 'B'], index=['One', 'Two'])
df2 = pd.DataFrame([[1, 2], [3, 4]], columns=['a', 'b'], index=['one', 'two'])


df1.values == df2.values

array([[ True,  True],
       [ True,  True]], dtype=bool)
piRSquared
  • 285,575
  • 57
  • 475
  • 624
  • This does solve the initial error, so marked as answer for that, however it doesn't do an element by element comparison. I was anticipating a matrix of boolean values (or similiar data structure of bools). I am getting back a single boolean value true or false. – David Crook Jun 01 '16 at 11:53
  • It is important to note that my data frames are of different size. – David Crook Jun 01 '16 at 12:09
  • 1
    what exactly is being gotten around here? I'm getting the same error with the same column and index names even taking into account capitalization. – Joseph Garvin Oct 24 '18 at 20:02
  • @DavidCrook, have you been able to find an answer? I have the same situation - two data frames of the same size with identical indices, identical columns, identical dtypes, but getting this error when trying to use compare() and equals()..Lost as to what could be happening.. – Tatiana Apr 21 '21 at 06:54
  • @Tatiana using .values[0] on one of the values in comparison resolved my issue. – Bikash Behera Jul 24 '21 at 14:57
3

If you want to compare 2 Data Frames. Check-out flexible comparison in Pandas, using the methods like .eq(), .nq(), gt() and more... --> equal, not equal and greater then.

Example:

df['new_col'] = df.gt(df_1)

http://pandas.pydata.org/pandas-docs/stable/basics.html#flexible-comparisons

Melroy van den Berg
  • 2,697
  • 28
  • 31
2

Replicated with some fake data to achieve the end goal of removing duplicates. Note this is not the answer to the original question, but what the answer was to what I was attempting to do that caused the question.

b = pd.DataFrame({'A': ['A4', 'A5', 'A6', 'A7'],
                    'B': ['B4', 'B5', 'B6', 'B7'],
                    'C': ['C4', 'C5', 'C6', 'C7'],
                    'D': ['D4', 'D5', 'D6', 'D7']},
                    index=[4, 5, 6, 7])


c = pd.DataFrame({'A': ['A7', 'A8', 'A9', 'A10', 'A11'],
                  'A': ['A7', 'A8', 'A9', 'A10', 'A11'],
                  'B': ['B7', 'B8', 'B9', 'B10', 'B11'],
                  'C': ['C7', 'C8', 'C9', 'C10', 'C11'],
                  'D': ['D7', 'D8', 'D9', 'D10', 'D11']},
                   index=[7, 8, 9, 10, 11])

result = pd.concat([b,c])
idx = np.unique(result["A"], return_index=True)[1]
result.iloc[idx].sort()
David Crook
  • 2,722
  • 3
  • 23
  • 49
0

I also faced the same issue and resolved it by sorting the index in both axis, before comparing two dataframes.

df1 = df1.sort_index(axis=1)
df2 = df2.sort_index(axis=1)
df1 = df1.sort_index()
df2 = df2.sort_index()