1

How read the multi nested JSON data in Spark. I have JSON file with Json Schema

I need to extract this schema format to TherapeuticArea line item like as below:

trialTherapeuticAreas_ID,trialTherapeuticAreas_name,trialDiseases_id,trialDiseases_name,trialPatientSegments_id,trialPatientSegments_name
zero323
  • 322,348
  • 103
  • 959
  • 935
ahmed
  • 273
  • 2
  • 8
  • 16

1 Answers1

2

You need to explode the arrays in a nested manner and select the struct elements in separate columns For that you would need explode inbuilt function and select api and aliasing.

Code to try :

import org.apache.spark.sql.functions._
val finalDF = df.withColumn("trialTherapeuticAreas", explode(col("trialTherapeuticAreas")))
                                      .select(col("trialTherapeuticAreas.id").as("trialTherapeuticAreas_ID"), col("trialTherapeuticAreas.name").as("trialTherapeuticAreas_name"), explode(col("trialTherapeuticAreas.trialDiseases")).as("trialDiseases"))
                                      .select(col("trialTherapeuticAreas_ID"), col("trialTherapeuticAreas_name"), col("trialDiseases.id").as("trialDiseases_id"), col("trialDiseases.name").as("trialDiseases_name"), explode(col("trialDiseases.trialPatientSegments")).as("trialPatientSegments"))
                                      .select(col("trialTherapeuticAreas_ID"), col("trialTherapeuticAreas_name"), col("trialDiseases_id"), col("trialDiseases_name"), col("trialPatientSegments.id").as("trialPatientSegments_id"), col("trialPatientSegments.name").as("trialPatientSegments_name"))

You should get your requirements fulfilled

You can do the above transformation using three withColumn api and one select statement too.

import org.apache.spark.sql.functions._
val finalDF = df.withColumn("trialTherapeuticAreas", explode(col("trialTherapeuticAreas")))
                .withColumn("trialDiseases", explode(col("trialTherapeuticAreas.trialDiseases")))
                .withColumn("trialPatientSegments", explode(col("trialDiseases.trialPatientSegments")))
                .select(col("trialTherapeuticAreas.id").as("trialTherapeuticAreas_ID"), col("trialTherapeuticAreas.name").as("trialTherapeuticAreas_name"), col("trialDiseases.id").as("trialDiseases_id"), col("trialDiseases.name").as("trialDiseases_name"), col("trialPatientSegments.id").as("trialPatientSegments_id"), col("trialPatientSegments.name").as("trialPatientSegments_name"))

Consecutive use of withColumn is not recommended for huge dataset as it might give random output. The reason is that withColumn is distributed and order of execution is not proved to be followed in serial manner

Ramesh Maharjan
  • 41,071
  • 6
  • 69
  • 97