-1

I am trying to merge 1000's of DataFrame into a single DF that are present as Seq[org.apache.spark.sql.DataFrame] as a List. So I used something like below, x is the list of Dataframes:

val y = x.reduce(_ union _)

But its taking eternity to complete.

Any other efficient way to complete the above task? Maybe via coding or even optimizing it via Spark configuration settings?

Any help is really appreciated.

Appy22
  • 119
  • 9

1 Answers1

1

first I would try a "batchwise" union, sometimes this helps:

dfs.grouped(50)
.map(dfss => dfss.reduce(_ union _))
.reduce(_ union _)

If thats not enough you can try with checkpooints:

dfs.grouped(50)
.map(dfss => dfss.reduce(_ union _).checkpoint(true))
.reduce(_ union _)

if the dataframes are reasonably small, you could also reduce the number of partitions (which is the sum of all partitions of your dataframes) by using dfss.reduce(_ union _).coalesce(1) in the inner map

Raphael Roth
  • 26,751
  • 15
  • 88
  • 145