I have a set of columns in my input data on which I am pivoting based on multiple columns.
I am facing issues with the column headers after the pivoting is done.
Input data
Output Generated by my approach -
Expected Output Headers:
I need the headers of the output to look like -
Steps done so far to achieve the Output I am getting -
// *Load the data*
scala> val input_data =spark.read.option("header","true").option("inferschema","true").option("delimiter","\t").csv("s3://mybucket/data.tsv")
// *Filter the data where residentFlag column = T*
scala> val filtered_data = input_data.select("numericID","age","salary","gender","residentFlag").filter($"residentFlag".contains("T"))
// *Now we will the pivot the filtered data by each column*
scala> val pivotByAge = filtered_data.groupBy("age","numericID").pivot("age").agg(expr("coalesce(first(numericID),'-')")).drop("age")
// *Pivot the data by the second column named "salary"*
scala> val pivotBySalary = filtered_data.groupBy("salary","numericID").pivot("salary").agg(expr("coalesce(first(numericID),'-')")).drop("salary")
// *Join the above two dataframes based on the numericID*
scala> val intermediateDf = pivotByAge.join(pivotBySalary,"numericID")
// *Now pivot the filtered data on Step 2 on the third column named Gender*
scala> val pivotByGender = filtered_data.groupBy("gender","numericID").pivot("gender").agg(expr("coalesce(first(numericID),'-')")).drop("gender")
// *Join the above dataframe with the intermediateDf*
scala> val outputDF= pivotByGender.join(intermediateDf ,"numericID")
How to rename the columns generated after pivoting?
Is there a different approach I can take for Pivoting the data set based on multiple columns (nearly 300)?
Any optimizations/suggestions for improving performance?