get distinct columns dataframe based on 2 ids

Question

Hello how can I do to only the lines where val is different in the 2 dataframes.

The way I need to filter is the following:

For each row of F1 (take each id1 if it is not null search for id1 F2 ) compare the VAL and if its different return it. else look at id2 and do the same thing.

Notice that I can have id1 or id2 or both as shown below:

d2 = {'id1': ['X22', 'X13',np.nan,'X02','X14'],'id2': ['Y1','Y2','Y3','Y4',np.nan],'VAL1':[1,0,2,3,0]}
F1 = pd.DataFrame(data=d2)
d2 = {'id1': ['X02', 'X13',np.nan,'X22','X14'],'id2': ['Y4','Y2','Y3','Y1','Y22'],'VAL2':[1,0,4,3,1]}
F2 = pd.DataFrame(data=d2)

Where F1 is:

   id1  id2  VAL1
0  X22   Y1     1
1  X13   Y2     0
2  NaN   Y3     2
3  X02   Y4     3
4  X14  NaN     0

and F2 is:

   id1  id2  VAL2
0  X02   Y4     1
1  X13   Y2     0
2  NaN   Y3     4
3  X22   Y1     3
4  X14  Y22     1

Expected output:

d2 = {'id1': ['X02',np.nan,'X22','X14'],'id2': ['Y4','Y3','Y1',np.nan],'VAL1':[3,2,1,0],'VAL2':[1,4,3,1]}
F3 = pd.DataFrame(data=d2)

   id1  id2  VAL1  VAL2
0  X02   Y4     3     1
1  NaN   Y3     2     4
2  X22   Y1     1     3
3  X14  NaN     0     1

i can get the value via id1 or id2 . No prob if one of them is null — Miss, Jun 07 '19 at 12:39
do you need `F1.merge(F2,on=['id1','id2'],how='right').query('VAL1!=VAL2')` ? — anky, Jun 07 '19 at 13:08
@SergeBallesta bace val1 == val2 however the comment by `anky_91` is correct except you have an extra record with the X14,Y22 row included hence my questions in the comments in my answer — B.C, Jun 07 '19 at 13:11
@SergeBallesta Yes, "only the lines where val is different" in the question :) — B.C, Jun 07 '19 at 13:16

score 1 · Answer 1 · answered Jun 07 '19 at 13:58

Ok it is a rather complex merge, because you want to merge on 2 columns, and any of them can contain NaN which should match anything (but not both).

I would to 2 separate merges:

first where id1 is not NaN in F1 on id1
second where id1 is NaN in F1 on id2

In both resultant dataframe, I would only keep rows where:

VAL1 != VAL2
AND (F1.id2 == F2.id2 or F1.id2 is NaN or F2.id2 is NaN)

Then I would concat them. Code could be:

t = F1.loc[~F1['id1'].isna()].merge(F2, on=['id1']).query('VAL1!=VAL2')
t = t[(t.id2_x==t.id2_y)|t.id2_x.isna()|t.id2_y.isna()]

t2 = F1.loc[F1['id1'].isna()].merge(F2, on=['id2']).query('VAL1!=VAL2')
t2 = t2[(t2.id1_x==t2.id1_y)|t2.id1_x.isna()|t2.id1_y.isna()]

# build back lost columns
t['id2'] = np.where(t['id2_x'].isna(), t['id2_y'], t['id2_x'])
t2['id1'] = np.where(t2['id1_x'].isna(), t2['id1_y'], t2['id1_x'])

# concat and reorder the columns
resul = pd.concat([t.drop(columns=['id2_x', 'id2_y']),
           t2.drop(columns=['id1_x', 'id1_y'])],
          ignore_index=True, sort=True).reindex(columns=
                                        ['id1', 'id2', 'VAL1', 'VAL2'])

Result is:

   id1  id2  VAL1  VAL2
0  X22   Y1     1     3
1  X02   Y4     3     1
2  X14  Y22     0     1
3  NaN   Y3     2     4

get distinct columns dataframe based on 2 ids

1 Answers1