Is it possible to create a Dataset of Dataframes in spark?

Question

This might be a stupid question. But just out of curiosity, can I do the following

var rawDF = spark.read
        .option("mode", "FAILFAST")
        .option("inferSchema", "true")
        .option("header", "false")
        .csv(hdfs_file)

val trainRatio = 0.8
val testRatio = 0.2
val Array(trainDF, testDF) = rawDF.randomSplit(Array(trainRatio, testRatio))
var temp : Dataset[DataFrame] = spark.emptyDataset[DataFrame]
val sampleDataset = Seq(trainDF,testDF).toDS()
temp = temp.union(sampleDataset)

In Intellij, I am getting error at .toDS() line. Let's say if we could do this, then we could easily apply the map opeartion on Dataset to compute arbitary computation on the dataframes inside it. Am I wrong to think this? Does this have to do with encoders not being present for Dataframe?

Hi, Please refer this. we need to import implicits . https://stackoverflow.com/questions/44516627/how-to-convert-a-dataframe-to-dataset-in-apache-spark-in-scala/44516773 — Newbie_Bigdata, Dec 05 '19 at 11:55
I have already imported implicits. So I guess it is not related to that. — user238607, Dec 05 '19 at 12:57

score 3 · Accepted Answer · answered Dec 05 '19 at 11:44

3

No. It is not possible. Dataframe and Dataset are both distributed collections and they are not serializable.

answered Dec 05 '19 at 11:44

Vladislav Varslavans

2,775
4
18
33

score 1 · Answer 2 · answered Dec 05 '19 at 21:50

1

As Vladislav earlier answered you cannot create a Dataset of Dataframes, but you can definately create a list of DataFrames (or Datasets or RDDs) if it makes sense !

For instance you can define a List of Dataframes, define the same transformations and concurrently run the same actions on each.

answered Dec 05 '19 at 21:50

baitmbarek

2,440
4
18
26

You mean using something like this : https://docs.scala-lang.org/overviews/parallel-collections/overview.html – user238607 Dec 06 '19 at 09:24
1

For the concurrent aspect, you can use a standard collection, wrapping functional concurrent types like Task (https://monix.io/docs/3x/eval/task.html) or at least the -impure but famous- Future :) (https://docs.scala-lang.org/overviews/core/futures.html) – baitmbarek Dec 06 '19 at 09:36

Is it possible to create a Dataset of Dataframes in spark?

2 Answers2