I have two large datasets,
val dataA : Dataset[TypeA]
and
val dataB: Dataset[TypeB]
, where both TypeA
and TypeB
extend Serializable
.
I want to join the two datasets on separate columns, so where TypeA.ColumnA == TypeB.ColumnB
. Spark offers the function JoinWith
on a dataset, which I think will do this properly, but the function is undocumented and marked as "experimental".
The other approach I have looked at is to use PairRDDs instead of datasets, and join them using a common key (like it says in this stackoverlow post here: how to join two datasets by key in scala spark).
Is there a specifically better approach to joining two datasets, or using either JoinWith or PairRDDs the best way?