How to join Spark datasets A and B and mark records in A which were not joined?

Question

I have two datasets A and B with TypeA and TypeB respectively. Then I join the datasets based on a column (lets call it "key") to get dataset C. After that, I need to discard events in dataset A which were joined with B and retain only those in A which could not be joined. How do I go about it?

Join an arbitrary column with the key. Filter for that arbitrary column being null. — samkart, Jan 18 '20 at 06:42
Seems a duplicate [Left Anti join in Spark?](https://stackoverflow.com/questions/43186888/left-anti-join-in-spark) — philipxy, Jan 18 '20 at 23:56
This is a faq. Before considering posting please always google any error message or many clear, concise & precise phrasings of your question/problem/goal, with & without your particular strings/names & site:stackoverflow.com & tags, & read many answers. If you post a question, use one phrasing as title. See [ask] & the voting arrow mouseover texts. — philipxy, Jan 18 '20 at 23:56

score 0 · Answer 1 · answered Jan 18 '20 at 10:23

0

What you are looking for is a left-anti join. Check out this post for more details Left Anti join in Spark?

answered Jan 18 '20 at 10:23

Paul

1,114
8
11

Thanks for your answer. Is there a way I can avoid the join operation(anti-join is also a kind of join if I understood it correctly) ? I am already performing join once between A and B to get what the resultant dataset, now I want to see if there is a way to find out the non-joined entries in A, without performing join with B – white-hawk-73 Jan 18 '20 at 12:13
you actually only need one join `c_filtered = a.join(b, some_condition, 'left_anti')` You can skip the part of having a non-filtered c – Paul Jan 18 '20 at 17:40
Please don't answer duplicates, (flag to) close. – philipxy Jan 18 '20 at 23:55

How to join Spark datasets A and B and mark records in A which were not joined?

1 Answers1