1

In our Spark-Scala application, we want to use typed Datasets. There is a JOIN operation. There is a join between DF1 & DF2 (DF - Dataframe).

My question is should we convert DF1 & DF2 both to Dataset[T] and then perform JOIN or should we do the JOIN and then convert the result DataFrame to Dataset.

As I understand since here Dataset[T] are being used for type safety so we should convert DF1 & DF2 to Dataset[T]. Can someone please confirm and advise if something is not correct?

Cizzl
  • 324
  • 2
  • 11
Don Sam
  • 525
  • 5
  • 20
  • 1
    Conceptually, consider DataFrame as an alias for a collection of generic objects Dataset[Row], where a Row is a generic untyped JVM object. Dataset, by contrast, is a collection of strongly-typed JVM objects, dictated by a case class you define in Scala or a class in Java. – Rex Mar 20 '19 at 06:41
  • This question related maybe https://stackoverflow.com/questions/40397206/how-can-i-combineconcatenate-two-data-frames-with-the-same-column-name-in-java – Rex Mar 20 '19 at 06:43
  • No. that question doesn't help. My question was should I convert both the DataFrames to Dataset[T] before JOIN or should I convert the result of JOIN to Dataset[T] – Don Sam Mar 20 '19 at 07:11
  • 1
    I see. Did you try to check the time of process when doing the both solution? – Rex Mar 20 '19 at 07:28
  • I think both solution is okay, But try to check to check if there is difference on processing time and if not then both solution is good. – Rex Mar 20 '19 at 07:29
  • 1
    Considering that typed joins [don't even type check](https://stackoverflow.com/q/40605167/10938362) I'd say at no point at all. – user10938362 Mar 20 '19 at 09:11
  • I have coded both i.e. joins with Typed Datasets and DataFrames; will test those and see the performance. – Don Sam Mar 22 '19 at 05:17

0 Answers0