I have two datasets A and B with TypeA and TypeB respectively. Then I join the datasets based on a column (lets call it "key") to get dataset C. After that, I need to discard events in dataset A which were joined with B and retain only those in A which could not be joined. How do I go about it?
Asked
Active
Viewed 167 times
-1
-
Join an arbitrary column with the key. Filter for that arbitrary column being null. – samkart Jan 18 '20 at 06:42
-
didn't get it. Can you explain? – white-hawk-73 Jan 18 '20 at 09:49
-
Seems a duplicate [Left Anti join in Spark?](https://stackoverflow.com/questions/43186888/left-anti-join-in-spark) – philipxy Jan 18 '20 at 23:56
-
This is a faq. Before considering posting please always google any error message or many clear, concise & precise phrasings of your question/problem/goal, with & without your particular strings/names & site:stackoverflow.com & tags, & read many answers. If you post a question, use one phrasing as title. See [ask] & the voting arrow mouseover texts. – philipxy Jan 18 '20 at 23:56
1 Answers
0
What you are looking for is a left-anti join. Check out this post for more details Left Anti join in Spark?

Paul
- 1,114
- 8
- 11
-
Thanks for your answer. Is there a way I can avoid the join operation(anti-join is also a kind of join if I understood it correctly) ? I am already performing join once between A and B to get what the resultant dataset, now I want to see if there is a way to find out the non-joined entries in A, without performing join with B – white-hawk-73 Jan 18 '20 at 12:13
-
you actually only need one join `c_filtered = a.join(b, some_condition, 'left_anti')` You can skip the part of having a non-filtered c – Paul Jan 18 '20 at 17:40
-