I have two dataframes that each have hundreds of columns.
#df1 = 190 columns
#df2 = 262 columns
#subset dataframe
df1 = pd.DataFrame({'Key': ['10003', '10009', '10010', '10034', '10665'],
'Num1': [12,13,13,13,13],
'Color': ['red','orange','red','red','red'],
'Date1': [20120506, 20120506, 20120506,20120506,20120620],
'Date2': [20120528, 20120507, 20120615,20120629,20120621]})
df2 = pd.DataFrame({'Key': ['10003', '10009', '10010', '10011', '10012','10034','10034', '10034'],
'Num1': [12,13,13,13,13,13,14,14],
'Num2': [121,122,122,124,125,126,127,128],
'Date1': [20120506, 20120506, 20120506,20120506,20120620,20120506,20120206,20120405],
'Date2': [20120528, 20120507, 20120615,20120629,20120621,20120629,20120506,20120506]})
I want to remove all rows in df2 that are also in df1, but leave df1 unchanged.
I am very close when using pd.concat() or merge(), but the problem is that I am creating a bunch of unnecessary columns [with concat() and merge()] and rows only found in df1 are added to df2 [with concat()]. Realistically, when 'Key','Date1','Num','and 'Date2' are common among rows, then they could be treated as duplicates.
The attempt below is close, but it adds extra columns from df1 and keeps all extra rows from df1. I don't not want any additional column or rows from df1, just to remove any duplicates in df2 that exist in both dfs. df1 will remain unchanged.
a = pd.concat([df2,df1])
a.drop_duplicates(subset=['Key', 'Num1','Date1','Date2'],keep=False, inplace=True)
Color Date1 Date2 Key Num1 Num2
3 NaN 20120506 20120629 10011 13 124.0
4 NaN 20120620 20120621 10012 13 125.0
6 NaN 20120206 20120506 10034 14 127.0
7 NaN 20120405 20120506 10034 14 128.0
4 red 20120620 20120621 10665 13 NaN
I also tried this with merge, with advice from here, but I still get duplicate columns.
df_all = df2.merge(df1.drop_duplicates(), on=['Key', 'Num1','Date1','Date2'],
how='left', indicator=True)
df_all[df_all['_merge'] == 'left_only']
Result from merge,
Date1 Date2 Key Num1 Num2 Color _merge
3 20120506 20120629 10011 13 124 NaN left_only
4 20120620 20120621 10012 13 125 NaN left_only
6 20120206 20120506 10034 14 127 NaN left_only
7 20120405 20120506 10034 14 128 NaN left_only
The merge gets me the correct rows but additional columns. Normally this wouldnt be an issue and I could just .drop() but there are 100+ extra columns after the merge.
How can I remove duplicates without having to increase columns or add additional rows from df1.
Final expected result:
Date1 Date2 Key Num1 Num2
3 20120506 20120629 10011 13 124.0
4 20120620 20120621 10012 13 125.0
6 20120206 20120506 10034 14 127.0
7 20120405 20120506 10034 14 128.0