Given a sql query involving duplicated columns
df = spark.sql(f"""
select * from
db.a a
left join db.b b on a.id = b.id
left join db.c c on b.id=c.id
left join db.d d on a.id= d.id
left join db.e e on a.id = e.id
left join db.f f on a.id = f.id
left join db.g g on a.id = e.id
left join db.h h on a.id = h.id
"""
)
The generated df contains a lot of duplicated columns. Some duplications are of 5 instances (id
) are some are of two/three occurrences.
I tried the methods suggested in SO, but all deal with joins of two tables.
The aim is to provide a holistic method that without any assumptions on the input dataframe or select query made, the output dataframe will not contain duplicated columns
As the query is done by another pipeline, I cannot change the join itself, or be exposed to the tables being joined. I only have the joint result, and need to filter accordingly.
Any ideas?