I have pyspark df1
|id |name | email |age|college|
|---| ------+ ----------------+---+-------|
|12 | Sta |sta@example.com |25 |clg1 |
|21 |Danny |dany@example.com |23 |clg2 |
|37 |Elle |elle@example.com |27 |clg3 |
|40 |Mark |mark1@example.com|40 |clg4 |
|36 |John |jhn@example.com |32 |clg5 |
I have pyspark df2
|id |name |age |
+---+-------+ ----+
|36 | Sta |30 |
|12 | raj |25 |
|29 | jack |33 |
|87 | Mark |67 |
|75 | Alle |23 |
|89 |Jalley |32 |
|55 |kale |99 |
Now I want to join the df2 with df1 to get the email and college attached to df2 on the below conditions:
if df1 id equals df2 id or df1 name equals df2 name df1 age equals df2 age if nothing matches fill NULL
In other words if first condition matched then it should not match with other condition, if first condition does not match then it should consider the other conditions to be matched subsequently if none of them match then fill Null.
for example
df2 should become like this
|id|name |age |email |college
|--| ------ | ---|------------------|-----
|36| Sta |30 |jhn@example.com |clg5
|12| raj |25 |sta@example.com |clg1
|29| jack |33 |NULL |NULL
|87| Mark |67 |mark1@example.com |clg4
|75| Alle |23 |dany@example.com |clg2
|89| Jalley |32 |jhn@example.com |clg5
|55| kale |99 |NULL |NULL
I have tried a lot with inbuilt join function but did not able to acheive that also tries with creating udf but they are very inefficient.
Also the data is too large cant apply any udf in it and running on spark cluster 3.x