Trying to remove duplicate column names in pyspark df after joining hdfs tables?
Hi, I am trying to join multiple datasets with the 200+ final number of columns. Because of the requirements and high number of columns I cannot Select specific columns when joining. Is there a way to remove duplicate columns after join. I know that there is a way to do it by .join method for spark df but the base tables I am joining are not spark df and I am trying to avoid converting them to spark df before join.
Original pyspark join query to create Spark DF#
cust_base=sqlc.sql('''
Select distinct *
FROM db.tbl1 as t1
LEFT JOIN db.tbl2 as t2 ON (t1.acct_id=t2.acct_id)
LEFT JOIN db.tbl3 as t3 ON (t1.cust_id=t3.cust_id)
WHERE t1.acct_subfam_mn IN ('PIA','PIM','IAA')
AND t1.active_acct_ct <> 0
AND t1.efectv_dt = '2018-10-31'
AND (t2.last_change_dt<='2018-10-31' AND (t2.to_dt is null OR t2.to_dt >
'2018-10-31'))
AND (t3.last_change_dt<='2018-10-31' AND (t3.to_dt is null OR t3.to_dt >
'2018-10-31'))
''').registerTempTable("df1")
error while checking distinct count of cust_id
a=sqlc.sql('''
Select
count(distinct a.cust_id) as CT_ID
From df1
''')
AnalysisException: "Reference 'cust_id' is ambiguous, could be: cust_id#7L,
cust_id#171L.; line 3 pos 15"
This is 'cust_id' field present more than once due to join
I want to remove duplicate columns from the resulting joined df. Thanks in advance