I have two DataFrames A
and B
:
A
has columns(id, info1, info2)
with about 200 Million rowsB
only has the columnid
with 1 million rows
The id
column is unique in both DataFrames.
I want a new DataFrame which filters A
to only include values from B
.
if B was very small I know I would something along the lines of
A.filter($("id") isin B("id"))
but B
is still pretty large, so not all of it can fit as a broadcast variable.
and I know I could use
A.join(B, Seq("id"))
but that wouldn't harness the uniqueness and I'm afraid will cause unnecessary shuffles.
What is the optimal method to achieve that task?