I have two tables with different but overlaping column sets. I want to concatenate them in a way that pandas does but it is very inefficient in spark.
X:
A B
0 1 3
1 2 4
Y:
A C
0 5 7
1 6 8
pd.concat(X, Y):
A B C
0 1 3 NaN
1 2 4 NaN
0 5 NaN 7
1 6 NaN 8
I tried to use Spark SQL to do it...
select A, B, null as C from X union all select A, null as B, C from Y
... and it is extremely slow. I applied this query to two tables with sizes: (79 rows, 17330 columns) and (92 rows, 16 columns). It took 129s running on Spark 1.62, 319s on Spark 2.01 and 1.2s on pandas. Why is it so slow? Is this some kind of bug? Can it be done faster using spark?
EDIT: I tried to do it programatically as in here: how to union 2 spark dataframes with different amounts of columns - it's even slower.
It seems that the problem is adding null columns maybe it can be solved somehow differently or this part could be made faster?