0

Given a sql query involving duplicated columns

    df = spark.sql(f""" 
    select * from 
    db.a a
    left join db.b b on a.id = b.id
    left join db.c c on b.id=c.id 
    left join db.d d on a.id= d.id 
    left join db.e e on a.id = e.id
    left join db.f f on a.id = f.id
    left join db.g g on a.id = e.id
    left join db.h h on a.id = h.id
    """
    )

The generated df contains a lot of duplicated columns. Some duplications are of 5 instances (id) are some are of two/three occurrences. I tried the methods suggested in SO, but all deal with joins of two tables.
The aim is to provide a holistic method that without any assumptions on the input dataframe or select query made, the output dataframe will not contain duplicated columns
As the query is done by another pipeline, I cannot change the join itself, or be exposed to the tables being joined. I only have the joint result, and need to filter accordingly.

Any ideas?

InsDSt
  • 37
  • 1
  • 4
  • based on the given query, i think you can use `drop(col('b.id'))` and so on where `b` is the table alias – samkart Aug 16 '22 at 09:08
  • But id is not the only column, and it is not only b of course. I'm looking for an holistic method that given a dataframe, without any assumptions on the join, can provide the desired output. – InsDSt Aug 16 '22 at 09:11
  • let's say you run `df.select('id')`, what error does it show? *must be something about ambiguous followed by `alias.column` for all id fields* – samkart Aug 16 '22 at 09:23
  • AFAIK, there isn't any solution (at least simple) that will help you in this case (you seem to not know about the query and the tables/alias being used). the best practice is to select only the required fields or drop the duplicated fields using the table/df alias. but see [this SO answer](https://stackoverflow.com/a/58405654/8279585) – samkart Aug 16 '22 at 09:29
  • @samkart you are right, the error gives me the alias name. but can I access it before the error? or should i let it error and iterate on all alias until I successed? – InsDSt Aug 16 '22 at 09:35

0 Answers0