Function to remove duplicate columns from a large dataset

Question

Trying to remove duplicate column names in pyspark df after joining hdfs tables?

Hi, I am trying to join multiple datasets with the 200+ final number of columns. Because of the requirements and high number of columns I cannot Select specific columns when joining. Is there a way to remove duplicate columns after join. I know that there is a way to do it by .join method for spark df but the base tables I am joining are not spark df and I am trying to avoid converting them to spark df before join.

Original pyspark join query to create Spark DF#

cust_base=sqlc.sql('''
Select distinct *
FROM db.tbl1 as t1
LEFT JOIN db.tbl2 as t2 ON (t1.acct_id=t2.acct_id) 
LEFT JOIN db.tbl3 as t3 ON (t1.cust_id=t3.cust_id)
WHERE t1.acct_subfam_mn IN ('PIA','PIM','IAA')
AND t1.active_acct_ct <> 0
AND t1.efectv_dt = '2018-10-31'
AND (t2.last_change_dt<='2018-10-31' AND (t2.to_dt is null OR t2.to_dt > 
'2018-10-31'))
AND (t3.last_change_dt<='2018-10-31' AND (t3.to_dt is null OR t3.to_dt > 
'2018-10-31'))
''').registerTempTable("df1")

error while checking distinct count of cust_id

 a=sqlc.sql('''
 Select 
 count(distinct a.cust_id) as CT_ID
 From df1
 ''')

AnalysisException: "Reference 'cust_id' is ambiguous, could be: cust_id#7L, 
cust_id#171L.; line 3 pos 15"

This is 'cust_id' field present more than once due to join

I want to remove duplicate columns from the resulting joined df. Thanks in advance

@vagautam- did you checked below link? I think this problem already has an answer..https://stackoverflow.com/questions/46944493/removing-duplicate-columns-after-a-df-join-in-spark — vikrant rana, Dec 20 '18 at 01:46
@vikrantrana - yes I checked the link already but as I clearly state in my question that my initial join tables are not spark df and I am trying to avoid to convert them. I am joining using pyspark sql context. — vagautam, Dec 21 '18 at 02:34
I am just thinking more on this.. never thought abt this situation.. generally this situation never comes if you code n select properly using alias.. but still will find a solution — vikrant rana, Dec 21 '18 at 14:23
@vikrantrana I agree, but alias method is appropiate if you are dealing with only few columns. However, if you are working with tables with 100+columns , it would be time consuming. For now I have converted the tables to spark df and renamed columns using df.toDF method. However, as stated in my original post, I was looking to avoid this method — vagautam, Dec 21 '18 at 18:40

score 0 · Answer 1 · answered Dec 21 '18 at 10:20

I can help to write a function to find the duplicate columns in a given dataframe.

Lets say below is dataframe having duplicate cols:

+------+----------------+----------+------+----------------+----------+
|emp_id|emp_joining_date|emp_salary|emp_id|emp_joining_date|emp_salary|
+------+----------------+----------+------+----------------+----------+
|     3|      2018-12-06|     92000|     3|      2018-12-06|     92000|
+------+----------------+----------+------+----------------+----------+

def finddups(*args):
    import collections
    dupes = []
    for cols in args:
        [dupes.append(item) for item, count in collections.Counter(cols).items() if count > 1]
        return dupes

   >>> duplicatecols = finddups(df.columns)
>>> print duplicatecols
['emp_id', 'emp_joining_date', 'emp_salary']

Can you check if it will retain the columns which are not duplicated? — Raghavendra Gupta, Dec 21 '18 at 14:38
@RaghavendraGupta the above suggested code is not dropping columns , only identifying column names repeating more than once in the df. Hence, all the columns are still in the df... — vagautam, Dec 21 '18 at 18:28

Function to remove duplicate columns from a large dataset

Original pyspark join query to create Spark DF#

error while checking distinct count of cust_id

1 Answers1