0

This might be a stupid question. But just out of curiosity, can I do the following

var rawDF = spark.read
        .option("mode", "FAILFAST")
        .option("inferSchema", "true")
        .option("header", "false")
        .csv(hdfs_file)

val trainRatio = 0.8
val testRatio = 0.2
val Array(trainDF, testDF) = rawDF.randomSplit(Array(trainRatio, testRatio))
var temp : Dataset[DataFrame] = spark.emptyDataset[DataFrame]
val sampleDataset = Seq(trainDF,testDF).toDS()
temp = temp.union(sampleDataset)

In Intellij, I am getting error at .toDS() line. Let's say if we could do this, then we could easily apply the map opeartion on Dataset to compute arbitary computation on the dataframes inside it. Am I wrong to think this? Does this have to do with encoders not being present for Dataframe?

user238607
  • 1,580
  • 3
  • 13
  • 18
  • 1
    Hi, Please refer this. we need to import implicits . https://stackoverflow.com/questions/44516627/how-to-convert-a-dataframe-to-dataset-in-apache-spark-in-scala/44516773 – Newbie_Bigdata Dec 05 '19 at 11:55
  • I have already imported implicits. So I guess it is not related to that. – user238607 Dec 05 '19 at 12:57

2 Answers2

3

No. It is not possible. Dataframe and Dataset are both distributed collections and they are not serializable.

Vladislav Varslavans
  • 2,775
  • 4
  • 18
  • 33
1

As Vladislav earlier answered you cannot create a Dataset of Dataframes, but you can definately create a list of DataFrames (or Datasets or RDDs) if it makes sense !

For instance you can define a List of Dataframes, define the same transformations and concurrently run the same actions on each.

baitmbarek
  • 2,440
  • 4
  • 18
  • 26
  • You mean using something like this : https://docs.scala-lang.org/overviews/parallel-collections/overview.html – user238607 Dec 06 '19 at 09:24
  • 1
    For the concurrent aspect, you can use a standard collection, wrapping functional concurrent types like Task (https://monix.io/docs/3x/eval/task.html) or at least the -impure but famous- Future :) (https://docs.scala-lang.org/overviews/core/futures.html) – baitmbarek Dec 06 '19 at 09:36