PySpark drop duplicated columns from multiple dataframes with not assumptions on the input join

Question

Given a sql query involving duplicated columns

    df = spark.sql(f""" 
    select * from 
    db.a a
    left join db.b b on a.id = b.id
    left join db.c c on b.id=c.id 
    left join db.d d on a.id= d.id 
    left join db.e e on a.id = e.id
    left join db.f f on a.id = f.id
    left join db.g g on a.id = e.id
    left join db.h h on a.id = h.id
    """
    )

The generated df contains a lot of duplicated columns. Some duplications are of 5 instances (id) are some are of two/three occurrences. I tried the methods suggested in SO, but all deal with joins of two tables.
The aim is to provide a holistic method that without any assumptions on the input dataframe or select query made, the output dataframe will not contain duplicated columns
As the query is done by another pipeline, I cannot change the join itself, or be exposed to the tables being joined. I only have the joint result, and need to filter accordingly.

Any ideas?

based on the given query, i think you can use `drop(col('b.id'))` and so on where `b` is the table alias — samkart, Aug 16 '22 at 09:08
But id is not the only column, and it is not only b of course. I'm looking for an holistic method that given a dataframe, without any assumptions on the join, can provide the desired output. — InsDSt, Aug 16 '22 at 09:11
let's say you run `df.select('id')`, what error does it show? *must be something about ambiguous followed by `alias.column` for all id fields* — samkart, Aug 16 '22 at 09:23
AFAIK, there isn't any solution (at least simple) that will help you in this case (you seem to not know about the query and the tables/alias being used). the best practice is to select only the required fields or drop the duplicated fields using the table/df alias. but see [this SO answer](https://stackoverflow.com/a/58405654/8279585) — samkart, Aug 16 '22 at 09:29
@samkart you are right, the error gives me the alias name. but can I access it before the error? or should i let it error and iterate on all alias until I successed? — InsDSt, Aug 16 '22 at 09:35

PySpark drop duplicated columns from multiple dataframes with not assumptions on the input join

0 Answers0