Join in pyspark without duplicate columns

Question

This is reference to solution given in scala for thread [How to avoid duplicate columns after join?

>> a.show
+---+----+
|key|val|
+---+----+
|  a|   1|
|  b|   2|
+---+----+

and

>>> b.show
+---+----+
|key|val|
+---+----+
|  a|   11|
+---+----+

Expected output

>>> 
+---+----+
|key|val|
+---+----+
|  a|   1|
+---+----+

So I have to fetch data from dataframe "a" when "key" matches on both "a" and "b"

One of the Solution given in scala is is working which is given below

scala> a.join(b, a("key") === b("key"), "left").select(a.columns.map(a(_)) : _*).show

Due to my no knowlege in scala , I am not able to implement this is python. Kindly help me fix this python. Any other solution would be appreciated (without hardcoding columns of dataframe)

score 1 · Accepted Answer · answered Sep 07 '18 at 07:04

1

val a = sc.parallelize(Seq(("a","1"),("b","2"))).toDF("key","value")
a.show

val b = sc.parallelize(Seq(("a","11"))).toDF("key","value")
b.show

a.join(b, a("key") === b("key"), "leftsemi").show

answered Sep 07 '18 at 07:04

Chandan Ray

2,031
1
10
15

how this can be done if I want my o/p to be when key is not matching . Here in this example it's | b | 2 | – Bharat Sharma Nov 15 '18 at 05:23

Join in pyspark without duplicate columns

1 Answers1