How to avoid duplicated columns after join operation?

Question

In Scala it's easy to avoid duplicate columns after join operation:

df1.join(df1, Seq("id"), "left").show()

However, is there a similar solution in PySpark? If I do df1.join(df1, df1["id"] == df2["id"], "left").show() in PySpark, I get two columns id...

Possible duplicate of [Spark Dataframe distinguish columns with duplicated name](https://stackoverflow.com/questions/33778664/spark-dataframe-distinguish-columns-with-duplicated-name) — pault, Jul 22 '19 at 14:28

score 2 · Accepted Answer · answered Jul 22 '19 at 09:19

2

You have 3 options :

1. Use outer join
aDF.join(bDF, "id", "outer").show()

2. Use Aliasing: You will lose data related to B Specific Id's in this.
aDF.alias("a").join(bDF.alias("b"), aDF.id == bDF.id, "outer").drop(col("b.id")).show()

3. Use drop to drop the columns
columns_to_drop = ['ida', 'idb']
df = df.drop(*columns_to_drop)

Let me know if that helps.

answered Jul 22 '19 at 09:19

Preetham

577
5
13

Thanks. Also I've just realized that `df1.join(df1,"id")` joins well and gives 1 `id` column. – Fluxy Jul 22 '19 at 09:32

How to avoid duplicated columns after join operation?

1 Answers1