-1

I have two datasets A and B with TypeA and TypeB respectively. Then I join the datasets based on a column (lets call it "key") to get dataset C. After that, I need to discard events in dataset A which were joined with B and retain only those in A which could not be joined. How do I go about it?

white-hawk-73
  • 856
  • 2
  • 10
  • 24
  • Join an arbitrary column with the key. Filter for that arbitrary column being null. – samkart Jan 18 '20 at 06:42
  • didn't get it. Can you explain? – white-hawk-73 Jan 18 '20 at 09:49
  • Seems a duplicate [Left Anti join in Spark?](https://stackoverflow.com/questions/43186888/left-anti-join-in-spark) – philipxy Jan 18 '20 at 23:56
  • This is a faq. Before considering posting please always google any error message or many clear, concise & precise phrasings of your question/problem/goal, with & without your particular strings/names & site:stackoverflow.com & tags, & read many answers. If you post a question, use one phrasing as title. See [ask] & the voting arrow mouseover texts. – philipxy Jan 18 '20 at 23:56

1 Answers1

0

What you are looking for is a left-anti join. Check out this post for more details Left Anti join in Spark?

Paul
  • 1,114
  • 8
  • 11
  • Thanks for your answer. Is there a way I can avoid the join operation(anti-join is also a kind of join if I understood it correctly) ? I am already performing join once between A and B to get what the resultant dataset, now I want to see if there is a way to find out the non-joined entries in A, without performing join with B – white-hawk-73 Jan 18 '20 at 12:13
  • you actually only need one join `c_filtered = a.join(b, some_condition, 'left_anti')` You can skip the part of having a non-filtered c – Paul Jan 18 '20 at 17:40
  • Please don't answer duplicates, (flag to) close. – philipxy Jan 18 '20 at 23:55