df1 :
+--------------------+--------------+----------------------+
|event_id |action_id |cancellation_action_id|
+--------------------+--------------+----------------------+
|a |actionIdUnique|null |
|c |ActionId002 |null |
+--------------------+--------------+----------------------+
df2 :
+--------------------+--------------+----------------------+
| event_id| action_id|cancellation_action_id|
+--------------------+--------------+----------------------+
|a |actionIdUnique| null|
|b | ActionId004| ActionId002|
|c | ActionId002| null|
+--------------------+--------------+----------------------+
df1 is basicly df2 where cancellation_action_id is null.
my goal is to keep row from df1 where value in action_id is not present in cancellation_action_id
desired output :
+--------------------+--------------+----------------------+
|event_id |action_id |cancellation_action_id|
+--------------------+--------------+----------------------+
|a |actionIdUnique|null |
+--------------------+--------------+----------------------+
even_id c is remove because his action_id (ActionId002) is equal to cancellation_action_id from event_id b.
If feel that there are 2 way to do this : using join or window function.
I try to use left anti join but i dont understand why my jointed dataframe is not equal to the expected one.
df3 = df1.join(df2, df1("action_id) === df2("cancellation_action_id") , "leftanti")
my result :
+--------------------+--------------+----------------------+
|event_id |action_id |cancellation_action_id|
+--------------------+--------------+----------------------+
|a |actionIdUnique|null |
|c |ActionId002 |null |
+--------------------+--------------+----------------------+
I dont understand why the last row is not removed.
both DataFrame come from same DataFrame so they have the same schema.