I am trying to compare two csv files using pandas dataframes. One is a master sheet that is going to have data appended to it daily (test_master.csv). The second is a daily report (test_daily.csv) that contains the data I want to append to the test_master.csv.
I am creating two pandas dataframes from these files:
import pandas as pd
dfmaster = pd.read_csv(test_master.csv)
dfdaily = pd.read_csv(test_daily.csv)
I want the daily list to get compared to the master list to see if there are any duplicate rows on the daily list that are already in the master list. If so, I want them to remove the duplicates from dfdaily. I then want to write this non-duplicate data to dfmaster.
The duplicate data will always be an entire row. My plan was to iterate through the sheets row by row to make the comparison, then.
I realize I could append my daily data to the dfmaster dataframe and use drop_duplicates to remove the duplicates. I cannot figure out how to remove the duplicates in the dfdaily dataframe, though. And I need to be able to write the dfdaily data back to test_daily.csv (or another new file) without the duplicate data.
Here is an example of what the dataframes could look like.
test_master.csv
column 1 | column 2 | column 3 |
+-------------+-------------+-------------+
| 1 | 2 | 3 |
| 4 | 5 | 6 |
| 7 | 8 | 9 |
| duplicate 1 | duplicate 1 | duplicate 1 |
| duplicate 2 | duplicate 2 | duplicate 2
test_daily.csv
+-------------+-------------+-------------+
| column 1 | column 2 | column 3 |
+-------------+-------------+-------------+
| duplicate 1 | duplicate 1 | duplicate 1 |
| duplicate 2 | duplicate 2 | duplicate 2 |
| 10 | 11 | 12 |
| 13 | 14 | 15 |
+-------------+-------------+-------------+
Desired output is:
test_master.csv
+-------------+-------------+-------------+
| column 1 | column 2 | column 3 |
+-------------+-------------+-------------+
| 1 | 2 | 3 |
| 4 | 5 | 6 |
| 7 | 8 | 9 |
| duplicate 1 | duplicate 1 | duplicate 1 |
| duplicate 2 | duplicate 2 | duplicate 2 |
| 10 | 11 | 12 |
| 13 | 14 | 15 |
+-------------+-------------+-------------+
test_daily.csv
+----------+----------+----------+
| column 1 | column 2 | column 3 |
+----------+----------+----------+
| 10 | 11 | 12 |
| 13 | 14 | 15 |
+----------+----------+----------+
Any help would be greatly appreciated!
EDIT
I incorrectly thought solutions from the set difference question solved my problem. I ran into certain cases where those solutions did not work. I believe it had something to do with index numbers labels as mentioned in a comment by Troy D below. Troy D's solution is the solution that I am now using.