Pyspark - Union tables with different column names

Question

Let's say I have the following 2 tables that I wanna union:

+---------+-------+---+---+---+
|month_key|account|ch1|ch2|ch5|
+---------+-------+---+---+---+
|      Aug|    abc|  0|  1|  1|
+---------+-------+---+---+---+
+---------+-------+---+---+---+
|month_key|account|ch3|ch4|ch5|
+---------+-------+---+---+---+
|     Sept|    bcd|  1|  1|  0|
+---------+-------+---+---+---+

How can I keep all the columns when doing the union? I'm getting the following results when I do:

df1.union(df2).show()

+---------+-------+---+---+---+
|month_key|account|ch1|ch2|ch5|
+---------+-------+---+---+---+
|      Aug|    abc|  0|  1|  1|
|     Sept|    bcd|  1|  1|  0|
+---------+-------+---+---+---+

In my actual dataset, I'll be having hundreds of channels and I think it's difficult to use join if I need to list all of the common columns.

I'm using spark version 2.4.3.

I know that unionByName can be used for spark 3.0 and above but is there an alternative?

Thanks for your help.

Does this answer your question? [How to perform union on two DataFrames with different amounts of columns in spark?](https://stackoverflow.com/questions/39758045/how-to-perform-union-on-two-dataframes-with-different-amounts-of-columns-in-spar) — blackbishop, Nov 15 '21 at 09:09
[This article](https://sparkbyexamples.com/spark/spark-merge-two-dataframes-with-different-columns/) shall give you the exact insight you are seeking for — MJK618, Nov 15 '21 at 09:14

Luiz Viola · Answer 1 · 2022-08-11T13:54:12.707

3

From https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.DataFrame.union.html:

"As standard in SQL, this function resolves columns by position (not by name)"

To do what you want:

df1.unionByName(df2, allowMissingColumns=True)

(available from Spark 3.1.X)

edited Aug 11 '22 at 13:54

answered Nov 15 '21 at 09:30

Luiz Viola

2,143
1
11
30

Pyspark - Union tables with different column names

1 Answers1