5

I have two dataframes that each have hundreds of columns.

#df1 = 190 columns
#df2 = 262 columns

#subset dataframe
df1 = pd.DataFrame({'Key': ['10003', '10009', '10010', '10034', '10665'], 
               'Num1': [12,13,13,13,13],
               'Color': ['red','orange','red','red','red'],
              'Date1': [20120506, 20120506, 20120506,20120506,20120620],
              'Date2': [20120528, 20120507, 20120615,20120629,20120621]})
df2 = pd.DataFrame({'Key': ['10003', '10009', '10010', '10011', '10012','10034','10034', '10034'], 
               'Num1': [12,13,13,13,13,13,14,14],
               'Num2': [121,122,122,124,125,126,127,128],
              'Date1': [20120506, 20120506, 20120506,20120506,20120620,20120506,20120206,20120405],
              'Date2': [20120528, 20120507, 20120615,20120629,20120621,20120629,20120506,20120506]})

I want to remove all rows in df2 that are also in df1, but leave df1 unchanged.

I am very close when using pd.concat() or merge(), but the problem is that I am creating a bunch of unnecessary columns [with concat() and merge()] and rows only found in df1 are added to df2 [with concat()]. Realistically, when 'Key','Date1','Num','and 'Date2' are common among rows, then they could be treated as duplicates.

The attempt below is close, but it adds extra columns from df1 and keeps all extra rows from df1. I don't not want any additional column or rows from df1, just to remove any duplicates in df2 that exist in both dfs. df1 will remain unchanged.

a = pd.concat([df2,df1])
a.drop_duplicates(subset=['Key', 'Num1','Date1','Date2'],keep=False, inplace=True)

    Color   Date1       Date2       Key     Num1    Num2
3   NaN     20120506    20120629    10011   13      124.0
4   NaN     20120620    20120621    10012   13      125.0
6   NaN     20120206    20120506    10034   14      127.0
7   NaN     20120405    20120506    10034   14      128.0
4   red     20120620    20120621    10665   13      NaN

I also tried this with merge, with advice from here, but I still get duplicate columns.

df_all = df2.merge(df1.drop_duplicates(), on=['Key', 'Num1','Date1','Date2'], 
               how='left', indicator=True)
df_all[df_all['_merge'] == 'left_only']

Result from merge,

     Date1      Date2       Key     Num1    Num2    Color   _merge
3    20120506   20120629    10011   13      124     NaN     left_only
4    20120620   20120621    10012   13      125     NaN     left_only
6    20120206   20120506    10034   14      127     NaN     left_only
7    20120405   20120506    10034   14      128     NaN     left_only

The merge gets me the correct rows but additional columns. Normally this wouldnt be an issue and I could just .drop() but there are 100+ extra columns after the merge.

How can I remove duplicates without having to increase columns or add additional rows from df1.

Final expected result:

    Date1       Date2       Key     Num1    Num2
3   20120506    20120629    10011   13      124.0
4   20120620    20120621    10012   13      125.0
6   20120206    20120506    10034   14      127.0
7   20120405    20120506    10034   14      128.0
CandleWax
  • 2,159
  • 2
  • 28
  • 46

3 Answers3

7

Use on keys as subset for df1

In [514]: on = ['Date1', 'Date2', 'Num1', 'Key']

In [515]: (df2.merge(df1[on], on=on, how='left', indicator=True)
              .query('_merge == "left_only"').drop('_merge', 1))
Out[515]:
      Date1     Date2    Key  Num1  Num2
3  20120506  20120629  10011    13   124
4  20120620  20120621  10012    13   125
6  20120206  20120506  10034    14   127
7  20120405  20120506  10034    14   128
Zero
  • 74,117
  • 18
  • 147
  • 154
  • 3
    Now Pandas v1.3.3 has a deprecation warning for `.drop('_merge', 1)`. Use instead with keyword argument: `.drop(columns='_merge')`. – thin Sep 19 '21 at 13:12
1

You can using concat with keys

s=pd.concat([df1,df2],keys=[1,2]).drop_duplicates(['Key', 'Num1','Date1','Date2'],keep='first')


df1=s.loc[1].dropna(axis=1)
df1
Out[1260]: 
    Color     Date1     Date2    Key  Num1
0     red  20120506  20120528  10003    12
1  orange  20120506  20120507  10009    13
2     red  20120506  20120615  10010    13
3     red  20120506  20120629  10034    13
4     red  20120620  20120621  10665    13
df2=s.loc[2].dropna(axis=1)
df2
Out[1262]: 
      Date1     Date2    Key  Num1   Num2
3  20120506  20120629  10011    13  124.0
4  20120620  20120621  10012    13  125.0
6  20120206  20120506  10034    14  127.0
7  20120405  20120506  10034    14  128.0
BENY
  • 317,841
  • 20
  • 164
  • 234
1

This is one way. Just concat in the opposite direction, then drop duplicates. Finally drop unwanted rows/columns from df1.

df = pd.concat([df1, df2], ignore_index=True)\
       .drop_duplicates(subset=['Date1', 'Date2', 'Key', 'Num1'], keep=False)

df = df.drop(df1.index & df.index, 0)\
       .drop(set(df1.columns) - set(df2.columns), 1)

#        Date1     Date2    Key  Num1   Num2
# 8   20120506  20120629  10011    13  124.0
# 9   20120620  20120621  10012    13  125.0
# 11  20120206  20120506  10034    14  127.0
# 12  20120405  20120506  10034    14  128.0

This is very similar to @Wen's solution, the main difference being the dropping of unwanted df1 data is explicit. Note, however, this method resets index.

jpp
  • 159,742
  • 34
  • 281
  • 339