I am new to programming and am cleaning up and simplifying my code to perform groupby and aggregation on a pyspark dataframe. I am trying to make things easier to follow and have been working on refactoring some of my code. When I try the following code, I get an error:
TypeError: Invalid argument, not a string or column:
Here is my code:
groupBy = ['ColA']
convert_to_list = ['Col1', 'Col2', 'Col3',]
convert_to_set = ['Col4', 'Col5', 'Col6',]
fun_list = [F.collect_list]
funs_set = [F.collect_set]
exprs = F.concat(
[f(F.col(c)) for f in fun_list for c in convert_to_list],
[f(F.col(c)) for f in funs_set for c in convert_to_set]
)
df = df.groupby(*groupBy).agg(*exprs)
Really appreciate your help. I am unsure how to pass the right columns to the agg function.
Sample input and expected output