0

I need to return boolean false if my input dataframe has duplicate columns with the same name. I wrote the below code. It identifies the duplicate columns from the input dataframe and returns the duplicated columns as a list. But when i call this function it must return boolean value i.e., if my input dataframe has duplicate columns with the same name it must return flase.

@udf('string')
def get_duplicates_cols(df, df_cols):
    duplicate_col_index = list(set([df_cols.index(c) for c in df_cols if df_cols.count(c) == 2]))
    for i in duplicate_col_index:
      df_cols[i] = df_cols[i] + '_duplicated'
      df2 = df.toDF(*df_cols)
    cols_to_remove = [c for c in df_cols if '_duplicated' in c]
    return cols_to_remove
duplicate_cols = udf(get_duplicates_cols,BooleanType())
ZygD
  • 22,092
  • 39
  • 79
  • 102
Ravali
  • 49
  • 1
  • 8

2 Answers2

2

You don't need any UDF, you simple need a Python function. The check will be in Python not in JVM. So, as @Santiago P said you can use checkDuplicate ONLY

    def checkDuplicate(df):
        return len(set(df.columns)) == len(df.columns) 
ggeop
  • 1,230
  • 12
  • 24
0

Assuming that you pass the data frame to the function.

udf(returnType=BooleanType())
    def checkDuplicate(df):
        return len(set(df.columns)) == len(df.columns)
Santiago P
  • 91
  • 1
  • 8
  • it's not returning any value. It must return false if my input dataframe contains contains any duplicate columns. – Ravali Jan 07 '20 at 15:50
  • This solution give you what you ask, `False` if there are duplicates, `True` if there are no duplicates columns. I think you are mixing things. In your code, you are trying to return a list of names (`ArrayType(StringType()`) of the duplicates columns. You can't return different types depending on the result (in a udf). `df.columns` give you a list of the columns in the dataframe, if you want the duplicates values I suggest reading this [post](https://stackoverflow.com/questions/9835762/how-do-i-find-the-duplicates-in-a-list-and-create-another-list-with-them) – Santiago P Jan 07 '20 at 19:08