Merge n DF's in a list to a single DataFrame - Scala

Question

I am trying to merge 1000's of DataFrame into a single DF that are present as Seq[org.apache.spark.sql.DataFrame] as a List. So I used something like below, x is the list of Dataframes:

val y = x.reduce(_ union _)

But its taking eternity to complete.

Any other efficient way to complete the above task? Maybe via coding or even optimizing it via Spark configuration settings?

Any help is really appreciated.

Try doing it in stages (so make say sqrt(1000) dataframes and write them out, then Read those in). — Nick, Oct 29 '20 at 08:48
Try doing some checkpointing say for every 40 to 50 data frames. Read them again and try to union them again. The key is to check the lineage graph and try to break it if it's large — Anand Sai, Oct 29 '20 at 08:57
https://stackoverflow.com/questions/37612622/spark-unionall-multiple-dataframes — thebluephantom, Oct 29 '20 at 09:30

Raphael Roth · Answer 1 · 2020-10-29T20:14:00.910

first I would try a "batchwise" union, sometimes this helps:

dfs.grouped(50)
.map(dfss => dfss.reduce(_ union _))
.reduce(_ union _)

If thats not enough you can try with checkpooints:

dfs.grouped(50)
.map(dfss => dfss.reduce(_ union _).checkpoint(true))
.reduce(_ union _)

if the dataframes are reasonably small, you could also reduce the number of partitions (which is the sum of all partitions of your dataframes) by using dfss.reduce(_ union _).coalesce(1) in the inner map

Merge n DF's in a list to a single DataFrame - Scala

1 Answers1