How to compare two pandas dataframes and remove duplicates on one file without appending data from other file

Question

I am trying to compare two csv files using pandas dataframes. One is a master sheet that is going to have data appended to it daily (test_master.csv). The second is a daily report (test_daily.csv) that contains the data I want to append to the test_master.csv.

I am creating two pandas dataframes from these files:

import pandas as pd

dfmaster = pd.read_csv(test_master.csv)
dfdaily = pd.read_csv(test_daily.csv)

I want the daily list to get compared to the master list to see if there are any duplicate rows on the daily list that are already in the master list. If so, I want them to remove the duplicates from dfdaily. I then want to write this non-duplicate data to dfmaster.

The duplicate data will always be an entire row. My plan was to iterate through the sheets row by row to make the comparison, then.

I realize I could append my daily data to the dfmaster dataframe and use drop_duplicates to remove the duplicates. I cannot figure out how to remove the duplicates in the dfdaily dataframe, though. And I need to be able to write the dfdaily data back to test_daily.csv (or another new file) without the duplicate data.

Here is an example of what the dataframes could look like.

test_master.csv

  column 1   |  column 2   |  column 3   |
+-------------+-------------+-------------+
| 1           | 2           | 3           |
| 4           | 5           | 6           |
| 7           | 8           | 9           |
| duplicate 1 | duplicate 1 | duplicate 1 |
| duplicate 2 | duplicate 2 | duplicate 2

test_daily.csv

+-------------+-------------+-------------+
|  column 1   |  column 2   |  column 3   |
+-------------+-------------+-------------+
| duplicate 1 | duplicate 1 | duplicate 1 |
| duplicate 2 | duplicate 2 | duplicate 2 |
| 10          | 11          | 12          |
| 13          | 14          | 15          |
+-------------+-------------+-------------+

Desired output is:

test_master.csv

+-------------+-------------+-------------+
|  column 1   |  column 2   |  column 3   |
+-------------+-------------+-------------+
| 1           | 2           | 3           |
| 4           | 5           | 6           |
| 7           | 8           | 9           |
| duplicate 1 | duplicate 1 | duplicate 1 |
| duplicate 2 | duplicate 2 | duplicate 2 |
| 10          | 11          | 12          |
| 13          | 14          | 15          |
+-------------+-------------+-------------+

test_daily.csv

+----------+----------+----------+
| column 1 | column 2 | column 3 |
+----------+----------+----------+
|       10 |       11 |       12 |
|       13 |       14 |       15 |
+----------+----------+----------+

Any help would be greatly appreciated!

EDIT

I incorrectly thought solutions from the set difference question solved my problem. I ran into certain cases where those solutions did not work. I believe it had something to do with index numbers labels as mentioned in a comment by Troy D below. Troy D's solution is the solution that I am now using.

The accepted answer only works if the dataframes have identical index labels for the rows, which doesn't seem the case for this question, so I don't think this is a duplicate of: https://stackoverflow.com/questions/18180763/set-difference-for-pandas — Troy D, Apr 30 '18 at 17:31

Troy D · Accepted Answer · 2018-04-30T17:10:53.453

3

Try this:

I crate 2 indexes, and then set rows 2-4 to be duplicates:

import numpy as np

test_master = pd.DataFrame(np.random.rand(3, 3), columns=['A', 'B', 'C'])
test_daily = pd.DataFrame(np.random.rand(5, 3), columns=['A', 'B', 'C'])
test_daily.iloc[1:4] = test_master[:3].values

print(test_master)
print(test_daily)

output:

      A         B         C
0  0.009322  0.330057  0.082956
1  0.197500  0.010593  0.356774
2  0.147410  0.697779  0.421207
      A         B         C
0  0.643062  0.335643  0.215443
1  0.009322  0.330057  0.082956
2  0.197500  0.010593  0.356774
3  0.147410  0.697779  0.421207
4  0.973867  0.873358  0.502973

Then, add a multiindex level to identify which data is from which dataframe:

test_master['master'] = 'master'
test_master.set_index('master', append=True, inplace=True)
test_daily['daily'] = 'daily'
test_daily.set_index('daily', append=True, inplace=True)

Now merge as you suggested and drop duplicates:

merged = test_master.append(test_daily)
merged = merged.drop_duplicates().sort_index()
print(merged)

output:

             A         B         C
  master                              
0 daily   0.643062  0.335643  0.215443
  master  0.009322  0.330057  0.082956
1 master  0.197500  0.010593  0.356774
2 master  0.147410  0.697779  0.421207
4 daily   0.973867  0.873358  0.502973

There you see the combined dataframe with the origin of the data labeled in the index. Now just slice for the daily data:

idx = pd.IndexSlice
print(merged.loc[idx[:, 'daily'], :])

output:

             A         B         C
  master                              
0 daily   0.643062  0.335643  0.215443
4 daily   0.973867  0.873358  0.502973

edited Apr 30 '18 at 17:10

answered Apr 30 '18 at 16:28

Troy D

2,093
1
14
28

Thank you Troy D. I was starting to run into some cases where answers from Set Difference question linked to above failed. Your solution seems to produce the exact result I was looking for. – Philalethes May 02 '18 at 15:16
Troy D. - Would you know why the above only works if I manually open and resave my "daily" csv file that I build the "daily data frame" from? I run a report and have it saved to a csv file automatically each night. If I just run my program on this "daily" csv file, the above code does not remove the duplicates. When I view the data frames, they appear to be identical. However, if I simply manually open, save, and close the "daily" csv file, the program then removes the duplicates. I am at a loss on how to remedy this. I have tried many workarounds without success. – Philalethes May 03 '18 at 16:16
Are you opening it in Excel? Excel will add single quotation marks around the values of a csv if you open it and then save it. It looks the same in Excel, but it's been changed. If you look at the file using a text editor instead, you should be able to see the difference that way. Then the CSV reader will see it differently depending on how it's formatted to read the data. That'd be my first guess. – Troy D May 03 '18 at 16:37
That is definitely it. In the text editor I see the original csv file has " " around each value. After I save it in excel and reopen it in a text editor, the " " are no longer there. Any suggestions on an easy work around? – Philalethes May 03 '18 at 16:48
If you're using pd.read_csv(), the "sep=" option allows you to specify the style of separators in the file. I'm not sure exactly what would work for your file, but maybe check out https://superuser.com/questions/1184776/read-csv-using-pandas-with-values-enclosed-with-double-quotes-and-values-have-co/1257803 – Troy D May 03 '18 at 19:01

How to compare two pandas dataframes and remove duplicates on one file without appending data from other file

1 Answers1

Linked