Joining tables in Databricks vs. SAS - how to handle duplicate column names

Question

long time SAS user, new to Databricks and am trying to migrate some basic code. Running into an extremely basic join issue but cannot find a solution.

In SAS (proc sql), when I run the following code, SAS is smart enough to realize that the joining columns are obviously on both the left and right tables, and so only produces one instance of those variables. e.g.

proc sql;
create table work.test as select * from
data.table1 t1
left join data.table2 t2 on (t1.bene_id=t2.bene_id) and (t1.pde_id=t2.pde_id)
;
quit;

This code runs just fine.

However, when I run the same thing in Databricks, it produces both instances of the bene_id and pde_id fields, and therefore bombs out when it tries to create the same (because its trying to create columns with the same name).

I realize one solution is to not use the * in the select statement, manually specify each field and ensure Im only selecting a single instance of each field, but with the number of joins happening + the number of fields Im dealing with, this is a real waste of time.

I also came across another potential solution is this sort of syntax

%python
from pyspark.sql import *

t1 = spark.read.table("data1")
t2 = spark.read.table("data2")

temp=t1.join(t2,["bene_id","pde_id"],"left")

However, this only suppresses duplicates for the fields being joined upon (i.e. bene_id and pde_id). If there was a 3rd field, say srvc_dt in both tables, but I am not using this field in the join, it will again be generated twice and bomb out.

Finally, I realize another solution is to write some code to dynamically rename the columns in both the left and right table so that all columns will always have unique names. I just feel like there has to be a simple way to achieve what SAS is doing without requiring all the workarounds, and Im just not aware of it.

Thanks for any advice.

Robert Kossendey · Answer 1 · 2022-12-31T13:44:23.560

1

You have to either rename the columns, drop one of the duplicates before joining or use aliases as described in this answer.

Spark wants you to be very explicit about which column you want to keep, so that you are not accidentally dropping columns.

edited Dec 31 '22 at 13:44

answered Dec 31 '22 at 12:57

Robert Kossendey

6,733
2
12
42

Joining tables in Databricks vs. SAS - how to handle duplicate column names

1 Answers1