long time SAS user, new to Databricks and am trying to migrate some basic code. Running into an extremely basic join issue but cannot find a solution.
In SAS (proc sql), when I run the following code, SAS is smart enough to realize that the joining columns are obviously on both the left and right tables, and so only produces one instance of those variables. e.g.
proc sql;
create table work.test as select * from
data.table1 t1
left join data.table2 t2 on (t1.bene_id=t2.bene_id) and (t1.pde_id=t2.pde_id)
;
quit;
This code runs just fine.
However, when I run the same thing in Databricks, it produces both instances of the bene_id and pde_id fields, and therefore bombs out when it tries to create the same (because its trying to create columns with the same name).
I realize one solution is to not use the * in the select statement, manually specify each field and ensure Im only selecting a single instance of each field, but with the number of joins happening + the number of fields Im dealing with, this is a real waste of time.
I also came across another potential solution is this sort of syntax
%python
from pyspark.sql import *
t1 = spark.read.table("data1")
t2 = spark.read.table("data2")
temp=t1.join(t2,["bene_id","pde_id"],"left")
However, this only suppresses duplicates for the fields being joined upon (i.e. bene_id and pde_id). If there was a 3rd field, say srvc_dt in both tables, but I am not using this field in the join, it will again be generated twice and bomb out.
Finally, I realize another solution is to write some code to dynamically rename the columns in both the left and right table so that all columns will always have unique names. I just feel like there has to be a simple way to achieve what SAS is doing without requiring all the workarounds, and Im just not aware of it.
Thanks for any advice.