You need to explode
the arrays in a nested manner and select the struct
elements in separate columns For that you would need explode
inbuilt function and select
api and aliasing.
Code to try :
import org.apache.spark.sql.functions._
val finalDF = df.withColumn("trialTherapeuticAreas", explode(col("trialTherapeuticAreas")))
.select(col("trialTherapeuticAreas.id").as("trialTherapeuticAreas_ID"), col("trialTherapeuticAreas.name").as("trialTherapeuticAreas_name"), explode(col("trialTherapeuticAreas.trialDiseases")).as("trialDiseases"))
.select(col("trialTherapeuticAreas_ID"), col("trialTherapeuticAreas_name"), col("trialDiseases.id").as("trialDiseases_id"), col("trialDiseases.name").as("trialDiseases_name"), explode(col("trialDiseases.trialPatientSegments")).as("trialPatientSegments"))
.select(col("trialTherapeuticAreas_ID"), col("trialTherapeuticAreas_name"), col("trialDiseases_id"), col("trialDiseases_name"), col("trialPatientSegments.id").as("trialPatientSegments_id"), col("trialPatientSegments.name").as("trialPatientSegments_name"))
You should get your requirements fulfilled
You can do the above transformation using three withColumn
api and one select
statement too.
import org.apache.spark.sql.functions._
val finalDF = df.withColumn("trialTherapeuticAreas", explode(col("trialTherapeuticAreas")))
.withColumn("trialDiseases", explode(col("trialTherapeuticAreas.trialDiseases")))
.withColumn("trialPatientSegments", explode(col("trialDiseases.trialPatientSegments")))
.select(col("trialTherapeuticAreas.id").as("trialTherapeuticAreas_ID"), col("trialTherapeuticAreas.name").as("trialTherapeuticAreas_name"), col("trialDiseases.id").as("trialDiseases_id"), col("trialDiseases.name").as("trialDiseases_name"), col("trialPatientSegments.id").as("trialPatientSegments_id"), col("trialPatientSegments.name").as("trialPatientSegments_name"))
Consecutive use of withColumn
is not recommended for huge dataset as it might give random output. The reason is that withColumn
is distributed and order of execution is not proved to be followed in serial manner