Spark unionAll multiple dataframes

Question

For a set of dataframes

val df1 = sc.parallelize(1 to 4).map(i => (i,i*10)).toDF("id","x")
val df2 = sc.parallelize(1 to 4).map(i => (i,i*100)).toDF("id","y")
val df3 = sc.parallelize(1 to 4).map(i => (i,i*1000)).toDF("id","z")

to union all of them I do

df1.unionAll(df2).unionAll(df3)

Is there a more elegant and scalable way of doing this for any number of dataframes, for example from

Seq(df1, df2, df3)

score 119 · Answer 1 · edited Jul 01 '23 at 18:49

119

For pyspark you can do the following:

from functools import reduce
from pyspark.sql import DataFrame

dfs = [df1,df2,df3]
df = reduce(DataFrame.unionAll, dfs)

It's also worth noting that the order of all the columns in all the dataframes in the list should be the same for this to work. This can silently give unexpected results if you don't have the correct column orders!!

If you are using pyspark 2.3 or greater, you can use unionByName so you don't have to reorder the columns.

edited Jul 01 '23 at 18:49

Vishwajeet Pol

43
7

answered Aug 31 '18 at 21:29

TH22

1,931
2
17
23

11

Please remember the point mentioned in bold. – pnv Dec 02 '20 at 06:34
1

Using Python's `reduce` means that the operations don't occur in parallel though.. correct? – Tim McNamara May 12 '21 at 03:47
5

How can i add a parameter like `allowMissingColumns=True`? – Sip Mar 14 '22 at 11:56
DataFrame.unionAll is now deprecated. Use DataFrame.union instead – kjsr7 Jun 16 '22 at 02:19
1

Wouldn't this be counterproductive to using spark as the reduce will write to disk? – Shivangi Singh Sep 14 '22 at 20:59
`DataFrame.unionByName` is another option that should resolve by name rather than order. – jonchar Mar 09 '23 at 21:27

zero323 · Accepted Answer · 2017-02-07T22:08:24.210

72

The simplest solution is to reduce with union (unionAll in Spark < 2.0):

val dfs = Seq(df1, df2, df3)
dfs.reduce(_ union _)

This is relatively concise and shouldn't move data from off-heap storage ~~but extends lineage with each union~~ requires non-linear time to perform plan analysis. what can be a problem if you try to merge large number of DataFrames.

You can also convert to RDDs and use SparkContext.union:

dfs match {
  case h :: Nil => Some(h)
  case h :: _   => Some(h.sqlContext.createDataFrame(
                     h.sqlContext.sparkContext.union(dfs.map(_.rdd)),
                     h.schema
                   ))
  case Nil  => None
}

It keeps ~~lineage short~~ analysis cost low but otherwise it is less efficient than merging DataFrames directly.

edited Feb 07 '17 at 22:08

answered Jun 03 '16 at 11:17

zero323

322,348
103
959
935

1

Thanks for all these approaches! – echo Jun 03 '16 at 11:43
Is this as simple in scala ? What would it be ? – Leothorn Jan 31 '18 at 11:41
4

How would the equivalent of this code be in pySpark? – drkostas Jun 09 '18 at 09:43
12

How is the performance is there are lots (say, more than 20) of DataFrames? – Benjamin Du Jun 29 '19 at 07:13
2

Also curious in performance for large number of DF – alex Dec 23 '20 at 15:50

score 2 · Answer 3 · answered Feb 16 '23 at 09:03

2

You can add parameters like allowMissingColumns by using reduce with lambda

from functools import reduce
from pyspark.sql import DataFrame

dfs = [df1, df2]
df = reduce(lambda x, y: x.unionByName(y, allowMissingColumns=True), dfs)

answered Feb 16 '23 at 09:03

NTB

21
1

score 1 · Answer 4 · answered Mar 22 '19 at 17:46

Under the Hood spark flattens union expressions. So it takes longer when the Union is done linearly.

The best solution is spark to have a union function that supports multiple DataFrames.

But the following code might speed up the union of multiple DataFrames (or DataSets)somewhat.

  def union[T : ClassTag](datasets : TraversableOnce[Dataset[T]]) : Dataset[T] = {
      binaryReduce[Dataset[T]](datasets, _.union(_))
  }
  def binaryReduce[T : ClassTag](ts : TraversableOnce[T], op: (T, T) => T) : T = {
      if (ts.isEmpty) {
         throw new IllegalArgumentException
      }
      var array = ts toArray
      var size = array.size
      while(size > 1) {
         val newSize = (size + 1) / 2
         for (i <- 0 until newSize) {
             val index = i*2
             val index2 = index + 1
             if (index2 >= size) {
                array(i) = array(index)  // last remaining
             } else {
                array(i) = op(array(index), array(index2))
             }
         }
         size = newSize
     }
     array(0)
 }

score 0 · Answer 5 · answered Jul 06 '23 at 07:31

In case some dataframes have missing columns, one can used a partially applied function:

from functools import reduce
from pyspark.sql import DataFrame

# Union dataframes by name (missing columns filled with null) 
union_by_name = partial(DataFrame.unionByName, allowMissingColumns=True)
df_output = reduce(union_by_name, [df1, df2, ...])

Spark unionAll multiple dataframes

5 Answers5

Linked

Related