4

I have two tables with different but overlaping column sets. I want to concatenate them in a way that pandas does but it is very inefficient in spark.

X:
   A  B
0  1  3
1  2  4

Y:
   A  C
0  5  7
1  6  8

pd.concat(X, Y):
   A   B   C
0  1   3 NaN
1  2   4 NaN
0  5 NaN   7
1  6 NaN   8

I tried to use Spark SQL to do it...

select A, B, null as C from X union all select A, null as B, C from Y

... and it is extremely slow. I applied this query to two tables with sizes: (79 rows, 17330 columns) and (92 rows, 16 columns). It took 129s running on Spark 1.62, 319s on Spark 2.01 and 1.2s on pandas. Why is it so slow? Is this some kind of bug? Can it be done faster using spark?

EDIT: I tried to do it programatically as in here: how to union 2 spark dataframes with different amounts of columns - it's even slower.

It seems that the problem is adding null columns maybe it can be solved somehow differently or this part could be made faster?

Community
  • 1
  • 1
tosterovic
  • 71
  • 9
  • a) How do you read data? b) How did you configure the cluster? –  Oct 19 '16 at 22:20
  • It's just a testing setup so I run standalone spark instance on my laptop through ./bin/pyspark. I preprocess the data in python and then call `X = sc.parallelize(list_of_rows).toDF(column_names)` to make it into data frame. – tosterovic Oct 19 '16 at 22:33
  • Possible duplicate of [how to union 2 spark dataframes with different amounts of columns](http://stackoverflow.com/questions/39758045/how-to-union-2-spark-dataframes-with-different-amounts-of-columns) – Alberto Bonsanto Oct 19 '16 at 23:06
  • The emphasis is on **efficiently**. The solution in your link is as slow as mine. – tosterovic Oct 20 '16 at 08:14

0 Answers0