3

I have two large datasets,

val dataA : Dataset[TypeA] and val dataB: Dataset[TypeB], where both TypeA and TypeB extend Serializable.

I want to join the two datasets on separate columns, so where TypeA.ColumnA == TypeB.ColumnB. Spark offers the function JoinWith on a dataset, which I think will do this properly, but the function is undocumented and marked as "experimental".

The other approach I have looked at is to use PairRDDs instead of datasets, and join them using a common key (like it says in this stackoverlow post here: how to join two datasets by key in scala spark).

Is there a specifically better approach to joining two datasets, or using either JoinWith or PairRDDs the best way?

sparkonhdfs
  • 1,313
  • 2
  • 17
  • 31
  • Take a look at this question: https://stackoverflow.com/questions/36462674/spark-dataset-api-join – Nanda Apr 16 '18 at 19:48
  • 2
    That doesn't answer the question I asked though. – sparkonhdfs Apr 16 '18 at 19:57
  • 4
    It depends on what you want the result type to be, JoinWith returns a `Dataset[(TypeA, TypeB)]` and the regular `join` will reutrn a `Dataset[Row]` (aka `DataFrame`) with the columns from both side of the join in a single flat structure like a normal sql join. – puhlen Apr 16 '18 at 20:55

0 Answers0