1

Let's say I have the following 2 tables that I wanna union:

+---------+-------+---+---+---+
|month_key|account|ch1|ch2|ch5|
+---------+-------+---+---+---+
|      Aug|    abc|  0|  1|  1|
+---------+-------+---+---+---+
+---------+-------+---+---+---+
|month_key|account|ch3|ch4|ch5|
+---------+-------+---+---+---+
|     Sept|    bcd|  1|  1|  0|
+---------+-------+---+---+---+

How can I keep all the columns when doing the union? I'm getting the following results when I do:

df1.union(df2).show()

+---------+-------+---+---+---+
|month_key|account|ch1|ch2|ch5|
+---------+-------+---+---+---+
|      Aug|    abc|  0|  1|  1|
|     Sept|    bcd|  1|  1|  0|
+---------+-------+---+---+---+

In my actual dataset, I'll be having hundreds of channels and I think it's difficult to use join if I need to list all of the common columns.

I'm using spark version 2.4.3.

I know that unionByName can be used for spark 3.0 and above but is there an alternative?

Thanks for your help.

Sarah Rahman
  • 203
  • 1
  • 2
  • 12
  • 2
    Does this answer your question? [How to perform union on two DataFrames with different amounts of columns in spark?](https://stackoverflow.com/questions/39758045/how-to-perform-union-on-two-dataframes-with-different-amounts-of-columns-in-spar) – blackbishop Nov 15 '21 at 09:09
  • [This article](https://sparkbyexamples.com/spark/spark-merge-two-dataframes-with-different-columns/) shall give you the exact insight you are seeking for – MJK618 Nov 15 '21 at 09:14

1 Answers1

3

From https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.DataFrame.union.html:

"As standard in SQL, this function resolves columns by position (not by name)"

To do what you want:

df1.unionByName(df2, allowMissingColumns=True)

(available from Spark 3.1.X)

Luiz Viola
  • 2,143
  • 1
  • 11
  • 30