anti join pandas data frames at different levels in python

Question

I am having two pandas data frames say df1 and df2. df1 has 6 variables and df2 has 5 variables. and first variable in both the data frames are in string format and reaming are in int format.

i want to identify the mismatched records in both data frames by using first 3 columns of both data frames and have to exclude them from df1 dataframe.

for that i tried the following code but it throws Nan values for me, if i drop the Nan values then required data will be deleted.

input data:-

**df1:-**                 **df2:-**  

x1   x2 x3 x4 x5 x6      x1  x2 x3 x4 x5
SM   1  1  2  3  3       RK  2  4  3  4  
RK   2  2  3  4  5       SM  1  1  3  3  
NBR  1  2  2  5  6       NB  1  2  3  2
CBK  2  5  6  7  8       VSB 5  6  3  2  
VSB  5  6  4  2  1       CB  2  6  4  1
SB   6  2  3  2  1       SB  6  2  4  1

expected_out_put:-

x1  x2 x3 x4 x5 x6
RK  2  2  3  4  5
CBK 2  5  6  7  8
NBR 1  2  2  5  6

syntax:-

data_out=df1[~df1['x1','x2','x3'].isin(df2['x1','x2','x3'])]
data_out=data_out.dropna()

please anyone can help me to tackle this.

Thanks in advance.

Performing an inner merge will keep only the records in `df1` that match with a record in `df2`: `df1.merge(df2, on=['col1', 'col2', 'col3'], how='inner')`. This will exclude anything mismatched, and also anything in `df1` that is not in `df2`. Not sure if that last condition counts as a 'mismatch` — ALollz, Aug 27 '18 at 15:43
It would be helpful if you provided some sample data, as well as an example of the desired output. Currently, it is not very clear what you are asking. — Andrew, Aug 27 '18 at 15:49
if your result df has na in all the rows , dropna will drop all rows which has na anywhere in any column, dropna(how='all') should be used where you want to drop the rows where all values are na in every column. However, if you post a sample df and the expected df , it will be helpful. — anky, Aug 27 '18 at 16:05

jezrael · Answer 1 · 2018-08-28T11:44:52.733

1

Use merge with left join first, get columns names of added columns from df2 and filter out all non NaNs rows by them:

df = df1.merge(df2, on=['x1', 'x2', 'x3'], how='left', suffixes=('','_'))
print (df)
    x1  x2  x3  x4  x5  x6  x4_  x5_
0   SM   1   1   2   3   3  3.0  3.0
1   RK   2   2   3   4   5  NaN  NaN
2   NB   1   2   2   5   6  3.0  2.0
3   CB   2   5   6   7   8  NaN  NaN
4  VSB   5   6   4   2   1  3.0  2.0
5   SB   6   2   3   2   1  4.0  1.0

cols = df.columns.difference(df1.columns)
print (cols)
Index(['x4_', 'x5_'], dtype='object')

df = df.loc[df[cols].isnull().all(axis=1), df1.columns.tolist()]
print (df)
   x1  x2  x3  x4  x5  x6
1  RK   2   2   3   4   5
3  CB   2   5   6   7   8

EDIT:

With your sample data I get:

df = df1.merge(df2, on=['x1', 'x2', 'x3'], how='left', suffixes=('','_'))
print (df)
    x1  x2  x3  x4  x5  x6  x4_  x5_
0   SM   1   1   2   3   3  3.0  3.0
1   RK   2   2   3   4   5  NaN  NaN
2  NBR   1   2   2   5   6  NaN  NaN
3  CBK   2   5   6   7   8  NaN  NaN
4  VSB   5   6   4   2   1  3.0  2.0
5   SB   6   2   3   2   1  4.0  1.0

cols = df.columns.difference(df1.columns)
print (cols)
Index(['x4_', 'x5_'], dtype='object')

df = df.loc[df[cols].isnull().all(axis=1), df1.columns.tolist()].x1.tolist()
print (df)
    x1  x2  x3  x4  x5  x6
1   RK   2   2   3   4   5
2  NBR   1   2   2   5   6
3  CBK   2   5   6   7   8

edited Aug 28 '18 at 11:44

answered Aug 28 '18 at 07:32

jezrael

822,522
95
1,334
1,252

it is giving expected results if both data frame records are in same order. if we change the order of the records in a data frame it won't give the expected result.can you check it once? – Neeraja Bandreddi Aug 28 '18 at 11:14
@neeraja - Do you think swap `SM` and `CB` rows? I think it should working nice. – jezrael Aug 28 '18 at 11:16
only in df2 not in df1. – Neeraja Bandreddi Aug 28 '18 at 11:22
@neeraja - Can you edit question with sample which not working? Tjhanks. – jezrael Aug 28 '18 at 11:23
@neeraja - Why `NBR` and `CBK` are swapped? Ordering is `['RK', 'NBR', 'CBK']` in sample data in `df1`, or not? – jezrael Aug 28 '18 at 11:41
output ordering is not an issue, but i want expected records in output data frame. – Neeraja Bandreddi Aug 28 '18 at 11:43
@neeraja - Please check edited answer, for me it working nice. – jezrael Aug 28 '18 at 11:45
1

Okay I will check it one more time.Thanks. – Neeraja Bandreddi Aug 28 '18 at 11:56
i tried the following code, it is working as expected. by=['x1','x2','x3'] out=df1[~df1[by].apply(tuple,1).isin(df2[by].apply(tuple,1))].reset_index() – Neeraja Bandreddi Aug 28 '18 at 12:39
@neeraja - So `out=df1[~df1[by].apply(tuple,1).isin(df2[by].apply(tuple,1))].reset_index()` working and my code not? – jezrael Aug 28 '18 at 12:42
1

it is also working fine, but in different situations that is not giving me the desired output.Thanks. – Neeraja Bandreddi Aug 28 '18 at 12:59

anti join pandas data frames at different levels in python

1 Answers1