The only way I found with hours of research is to rename the column set, then create another dataframe with the new set as the header.
Eg, if you have:
>>> import pyspark
>>> from pyspark.sql import SQLContext
>>>
>>> sc = pyspark.SparkContext()
>>> sqlContext = SQLContext(sc)
>>> df = sqlContext([(1, 2, 3), (4, 5, 6)], ['a', 'b', 'a'])
DataFrame[a: bigint, b: bigint, a: bigint]
>>> df.columns
['a', 'b', 'a']
>>> df2 = df.toDF('a', 'b', 'c')
>>> df2.columns
['a', 'b', 'c']
You can get the list of columns using df.columns
and then use a loop to rename any duplicates to get the new column list (don't forget to pass *new_col_list
instead of new_col_list
to toDF
function else it'll throw an invalid count error).