1

I'm having some difficulties using pandas..

I have 2 dataframes (named bru and bru2) both coming from almost the same file. the only diffrence between the 2 files is that I have added an extra row and changed a cell value from "4" to "50000" for testing.

What i'd now like to do is look for changed cells and new rows.

But first of all, I'm checking if both dataframes are the same so that I don't have to look for changes when both files have the exact same data.

When I try to compare them (bru == bru2), I get an error: Can only compare identically-labeled DataFrame objects.

I'm importing the files like this, I also drop some columns that I don't need, reorder both files their columns in the exact same order and rename some for prefrence:

bru = pd.read_csv("file1.csv", dtype={"street_id": "string",  "address_id": "string"})
bru = bru.fillna('')
bru = bru.drop(columns=["EPSG:31370_x", "EPSG:31370_y", "EPSG:4326_lat", "EPSG:4326_lon", "postname_fr", "postname_nl", "streetname_de"])
bru = bru.rename(columns={"postcode": "pkancode"})
bru = bru.reindex(columns=["address_id", "box_number", "house_number", "municipality_id", "municipality_name_de", "municipality_name_fr", "municipality_name_nl", "pkancode", "street_id", "streetname_nl", "streetname_fr", "region_code", "status"])
    

bru2 = pd.read_csv("file2.csv", dtype={"street_id": "string",  "address_id": "string"})
bru2 = bru2.fillna('')
bru2 = bru2.drop(columns=["EPSG:31370_x", "EPSG:31370_y", "EPSG:4326_lat", "EPSG:4326_lon", "postname_fr", "postname_nl", "streetname_de"])
bru2 = bru2.rename(columns={"postcode": "pkancode"})
bru2 = bru2.reindex(columns=["address_id", "box_number", "house_number", "municipality_id", "municipality_name_de", "municipality_name_fr", "municipality_name_nl", "pkancode", "street_id", "streetname_nl", "streetname_fr", "region_code", "status"])

enter image description here

enter image description here

What am I doing wrong?

I've tried other solutions from the stack that for some reason failed for me:

Error: Can only compare identically-labeled DataFrame objects

Pandas "Can only compare identically-labeled DataFrame objects" error

Yorbjörn
  • 356
  • 3
  • 21
  • Index are both the column headers and the row indexes. Try using [`pd.DataFrame.compare`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.compare.html#pandas-dataframe-compare). – Scott Boston Dec 01 '21 at 13:48
  • 1
    @ScottBoston sadly still the same error when i try ```bru.compare(bru2)``` – Yorbjörn Dec 01 '21 at 13:55
  • 1
    Now if you only compare the rows in bru2 that appear in bru, then let's use reindex_like. `bru2.reindex_like(bru).compare(bru)` this only compare the rows and columns that are in bru. – Scott Boston Dec 01 '21 at 13:56
  • 1
    And you can use bru.index.difference(bru2.index) to find difference in rows. Same with column headers if needed. – Scott Boston Dec 01 '21 at 13:58

1 Answers1

2

You can use reindex_like to make bru2 have the same indexing as bru then compare the dataframes.

bru2.reindex_like(bru).compare(bru)

And you can use pd.Index.difference to find the rows or columns in bru2 that are in bru.

bru.index.difference(bru2.index) #and like wise with bru.columns and bru2.columns
Scott Boston
  • 147,308
  • 15
  • 139
  • 187