Is Dataset.JoinWith the best way to join two datasets in Spark?

Asked Apr 16 '18 at 19:36

Active Apr 16 '18 at 19:36

Viewed 4,816 times

I have two large datasets,

val dataA : Dataset[TypeA] and val dataB: Dataset[TypeB], where both TypeA and TypeB extend Serializable.

I want to join the two datasets on separate columns, so where TypeA.ColumnA == TypeB.ColumnB. Spark offers the function JoinWith on a dataset, which I think will do this properly, but the function is undocumented and marked as "experimental".

The other approach I have looked at is to use PairRDDs instead of datasets, and join them using a common key (like it says in this stackoverlow post here: how to join two datasets by key in scala spark).

Is there a specifically better approach to joining two datasets, or using either JoinWith or PairRDDs the best way?

asked Apr 16 '18 at 19:36

sparkonhdfs

1,313
2
17
31

Take a look at this question: https://stackoverflow.com/questions/36462674/spark-dataset-api-join – Nanda Apr 16 '18 at 19:48
2

That doesn't answer the question I asked though. – sparkonhdfs Apr 16 '18 at 19:57
4

It depends on what you want the result type to be, JoinWith returns a `Dataset[(TypeA, TypeB)]` and the regular `join` will reutrn a `Dataset[Row]` (aka `DataFrame`) with the columns from both side of the join in a single flat structure like a normal sql join. – puhlen Apr 16 '18 at 20:55

Is Dataset.JoinWith the best way to join two datasets in Spark?

0 Answers0