scala - Spark : How to union all dataframe in loop

Question

Is there a way to get the dataframe that union dataframe in loop?

This is a sample code:

var fruits = List(
  "apple"
  ,"orange"
  ,"melon"
) 

for (x <- fruits){         
  var df = Seq(("aaa","bbb",x)).toDF("aCol","bCol","name")
}

I would want to obtain some like this:

aCol | bCol | fruitsName
aaa,bbb,apple
aaa,bbb,orange
aaa,bbb,melon

Thanks again

What is this code ? and what are you actually trying to do here ? — sarveshseri, Apr 19 '17 at 08:30

score 26 · Answer 1 · answered Apr 20 '17 at 11:58

26

You could created a sequence of DataFrames and then use reduce:

val results = fruits.
  map(fruit => Seq(("aaa", "bbb", fruit)).toDF("aCol","bCol","name")).
  reduce(_.union(_))

results.show()

answered Apr 20 '17 at 11:58

Ramon

8,202
4
33
41

2

simple and beautiful! – ShirishT Aug 14 '18 at 18:53
Good to see the immutable approach – codeaperature Sep 13 '21 at 22:49

score 21 · Accepted Answer · edited May 23 '17 at 10:31

Steffen Schmitz's answer is the most concise one I believe. Below is a more detailed answer if you are looking for more customization (of field types, etc):

import org.apache.spark.sql.types.{StructType, StructField, StringType}
import org.apache.spark.sql.Row

//initialize DF
val schema = StructType(
  StructField("aCol", StringType, true) ::
  StructField("bCol", StringType, true) ::
  StructField("name", StringType, true) :: Nil)
var initialDF = spark.createDataFrame(sc.emptyRDD[Row], schema)

//list to iterate through
var fruits = List(
    "apple"
    ,"orange"
    ,"melon"
)

for (x <- fruits) {
  //union returns a new dataset
  initialDF = initialDF.union(Seq(("aaa", "bbb", x)).toDF)
}

//initialDF.show()

references:

score 15 · Answer 3 · answered Aug 31 '18 at 12:48

15

If you have different/multiple dataframes you can use below code, which is efficient.

val newDFs = Seq(DF1,DF2,DF3)
newDFs.reduce(_ union _)

answered Aug 31 '18 at 12:48

Arun Goudar

361
3
5

how can i keep adding new dataframes to the Seq using a loop? I would like to do a union at the end, but the dataframes in my Seq are to be added using a loop. Is it doable? – Regressor Jul 02 '19 at 06:55
Why is this efficcient? If you are applying a reduce function to a Scala Seq you are not making use of cluster paralelism and no distributed computing at all, right? – Borja_042 Sep 18 '19 at 11:38

score 7 · Answer 4 · answered Apr 19 '17 at 13:26

7

In a for loop:

val fruits = List("apple", "orange", "melon")

( for(f <- fruits) yield ("aaa", "bbb", f) ).toDF("aCol", "bCol", "name")

answered Apr 19 '17 at 13:26

Steffen Schmitz

860
3
16
34

sarveshseri · Answer 5 · 2017-04-19T09:50:16.690

1

Well... I think your question is a bit mis-guided.

As per my limited understanding of whatever you are trying to do, you should be doing following,

val fruits = List(
  "apple",
  "orange",
  "melon"
)

val df = fruits
  .map(x => ("aaa", "bbb", x))
  .toDF("aCol", "bCol", "name")

And this should be sufficient.

edited Apr 19 '17 at 09:50

answered Apr 19 '17 at 08:34

sarveshseri

13,738
28
47

Thanks Sarvesh.. but I only have to get the union dataframe in Loop.. because there are various operation such as join, withColumn in Loop . I will get the dataframe from hiveSql in Loop. – J.soo Apr 19 '17 at 08:54
"union data-frame in loop" well... just this one statement leaves me unable to answer this question. Why do you need this "union data-frame in loop" ? Can you elaborate in your question with more details about - "various operation such as join, withColumn in Loop". – sarveshseri Apr 19 '17 at 09:42

score 1 · Answer 6 · answered Apr 19 '17 at 08:56

1

you can first create a sequence and then use toDF to create Dataframe.

scala> var dseq : Seq[(String,String,String)] = Seq[(String,String,String)]()
dseq: Seq[(String, String, String)] = List()

scala> for ( x <- fruits){
     |  dseq = dseq :+ ("aaa","bbb",x)
     | }

scala> dseq
res2: Seq[(String, String, String)] = List((aaa,bbb,apple), (aaa,bbb,orange), (aaa,bbb,melon))

scala> val df = dseq.toDF("aCol","bCol","name")
df: org.apache.spark.sql.DataFrame = [aCol: string, bCol: string, name: string]

scala> df.show
+----+----+------+
|aCol|bCol|  name|
+----+----+------+
| aaa| bbb| apple|
| aaa| bbb|orange|
| aaa| bbb| melon|
+----+----+------+

answered Apr 19 '17 at 08:56

Rajat Mishra

3,635
4
27
41

And why did you feel the need to introduce a `var` here ? – sarveshseri Apr 19 '17 at 09:43
actually what i tried was to a create a `Seq` and convert it to dataframe, since i'm iterating through the list of fruit and appending it into a same variable, so i have taken it as `var`. – Rajat Mishra Apr 19 '17 at 09:47
1

The OP has used `var` but he did not actually need it. And, you could have just `mapped` the `fruits` into your `dseq`. The important thing to note here is that your `dseq` is a `List`. And then you are appending to this list in your `for` "loop". The problem with this is that `append` on `List` is `O(n)` making your whole `dseq` generation `O(n^2)`, which will just kill performance on large data. – sarveshseri Apr 19 '17 at 09:51
Just make it a general principle to avoid `append` with Scala `List`. – sarveshseri Apr 19 '17 at 09:56
Thanks @SarveshKumarSingh. – Rajat Mishra Apr 19 '17 at 10:03

scala - Spark : How to union all dataframe in loop

6 Answers6

Linked