1

In Scala it's easy to avoid duplicate columns after join operation:

df1.join(df1, Seq("id"), "left").show()

However, is there a similar solution in PySpark? If I do df1.join(df1, df1["id"] == df2["id"], "left").show() in PySpark, I get two columns id...

Fluxy
  • 2,838
  • 6
  • 34
  • 63
  • Possible duplicate of [Spark Dataframe distinguish columns with duplicated name](https://stackoverflow.com/questions/33778664/spark-dataframe-distinguish-columns-with-duplicated-name) – pault Jul 22 '19 at 14:28

1 Answers1

2

You have 3 options :

1. Use outer join
aDF.join(bDF, "id", "outer").show()

2. Use Aliasing: You will lose data related to B Specific Id's in this.
aDF.alias("a").join(bDF.alias("b"), aDF.id == bDF.id, "outer").drop(col("b.id")).show()

3. Use drop to drop the columns
columns_to_drop = ['ida', 'idb']
df = df.drop(*columns_to_drop)

Let me know if that helps.

Preetham
  • 577
  • 5
  • 13
  • Thanks. Also I've just realized that `df1.join(df1,"id")` joins well and gives 1 `id` column. – Fluxy Jul 22 '19 at 09:32