In Spark-Scala application involving Join, att what point should we convert Dataframe to Dataset?

Question

In our Spark-Scala application, we want to use typed Datasets. There is a JOIN operation. There is a join between DF1 & DF2 (DF - Dataframe).

My question is should we convert DF1 & DF2 both to Dataset[T] and then perform JOIN or should we do the JOIN and then convert the result DataFrame to Dataset.

As I understand since here Dataset[T] are being used for type safety so we should convert DF1 & DF2 to Dataset[T]. Can someone please confirm and advise if something is not correct?

Conceptually, consider DataFrame as an alias for a collection of generic objects Dataset[Row], where a Row is a generic untyped JVM object. Dataset, by contrast, is a collection of strongly-typed JVM objects, dictated by a case class you define in Scala or a class in Java. — Rex, Mar 20 '19 at 06:41
This question related maybe https://stackoverflow.com/questions/40397206/how-can-i-combineconcatenate-two-data-frames-with-the-same-column-name-in-java — Rex, Mar 20 '19 at 06:43
No. that question doesn't help. My question was should I convert both the DataFrames to Dataset[T] before JOIN or should I convert the result of JOIN to Dataset[T] — Don Sam, Mar 20 '19 at 07:11
I see. Did you try to check the time of process when doing the both solution? — Rex, Mar 20 '19 at 07:28
I think both solution is okay, But try to check to check if there is difference on processing time and if not then both solution is good. — Rex, Mar 20 '19 at 07:29
Considering that typed joins [don't even type check](https://stackoverflow.com/q/40605167/10938362) I'd say at no point at all. — user10938362, Mar 20 '19 at 09:11
I have coded both i.e. joins with Typed Datasets and DataFrames; will test those and see the performance. — Don Sam, Mar 22 '19 at 05:17

In Spark-Scala application involving Join, att what point should we convert Dataframe to Dataset?

0 Answers0