unionAll in spark SQL goes on hang

Question

I am having two data sets received after OLTP/OLAP processing , though these two data sets contains same field but while fetching the same they varies in their schema say by field being Null or Not Null.

To explain in detail say I have df1 and df2.

df1 has field 'a' long with not null
df2 has field 'a' long with null

Now while I am doing unionAll , I am modifying one of the schema as

val x = df1.sqlContext.createDataFrame(df1.rdd, df2.schema)
x.unionAll(df2)
o/p : Job execution goes on hang

also another try

df1.sqlContext.createDataFrame(df1.rdd, df2.schema)
df1.unionAll(df2)
o/p : Here also Job execution goes on hang

Kindly let me know how we can avoid this issue or if I am doing anything wrong.

~Prashant

score 0 · Answer 1 · edited May 23 '17 at 10:33

I experienced the same thing. Check the number of partitions before and after the unionAll. You can see that this count has probably df1 + df2 as the operation is 'concatenating rows'. You can probably repartition your data like this:

val partitioner = new HashPartitioner(5) 

sqlContext.createDataFrame(
  df.rdd.map(r => (r.getInt(1), r)).partitionBy(partitioner).values,
  df.schema
)

See How to define partitioning of DataFrame? for more info on the partitioning.

unionAll in spark SQL goes on hang

1 Answers1