How to read the multi nested JSON data in Spark

Question

How read the multi nested JSON data in Spark. I have JSON file with

I need to extract this schema format to TherapeuticArea line item like as below:

trialTherapeuticAreas_ID,trialTherapeuticAreas_name,trialDiseases_id,trialDiseases_name,trialPatientSegments_id,trialPatientSegments_name

`df.select($"trialTherapeuticAreas")`;would be a good start – OneCricketeer Feb 07 '18 at 12:26 — OneCricketeer, Feb 07 '18 at 12:26

Ramesh Maharjan · Accepted Answer · 2018-02-07T15:43:59.663

You need to explode the arrays in a nested manner and select the struct elements in separate columns For that you would need explode inbuilt function and select api and aliasing.

Code to try :

import org.apache.spark.sql.functions._
val finalDF = df.withColumn("trialTherapeuticAreas", explode(col("trialTherapeuticAreas")))
                                      .select(col("trialTherapeuticAreas.id").as("trialTherapeuticAreas_ID"), col("trialTherapeuticAreas.name").as("trialTherapeuticAreas_name"), explode(col("trialTherapeuticAreas.trialDiseases")).as("trialDiseases"))
                                      .select(col("trialTherapeuticAreas_ID"), col("trialTherapeuticAreas_name"), col("trialDiseases.id").as("trialDiseases_id"), col("trialDiseases.name").as("trialDiseases_name"), explode(col("trialDiseases.trialPatientSegments")).as("trialPatientSegments"))
                                      .select(col("trialTherapeuticAreas_ID"), col("trialTherapeuticAreas_name"), col("trialDiseases_id"), col("trialDiseases_name"), col("trialPatientSegments.id").as("trialPatientSegments_id"), col("trialPatientSegments.name").as("trialPatientSegments_name"))

You should get your requirements fulfilled

You can do the above transformation using three withColumn api and one select statement too.

import org.apache.spark.sql.functions._
val finalDF = df.withColumn("trialTherapeuticAreas", explode(col("trialTherapeuticAreas")))
                .withColumn("trialDiseases", explode(col("trialTherapeuticAreas.trialDiseases")))
                .withColumn("trialPatientSegments", explode(col("trialDiseases.trialPatientSegments")))
                .select(col("trialTherapeuticAreas.id").as("trialTherapeuticAreas_ID"), col("trialTherapeuticAreas.name").as("trialTherapeuticAreas_name"), col("trialDiseases.id").as("trialDiseases_id"), col("trialDiseases.name").as("trialDiseases_name"), col("trialPatientSegments.id").as("trialPatientSegments_id"), col("trialPatientSegments.name").as("trialPatientSegments_name"))

Consecutive use of withColumn is not recommended for huge dataset as it might give random output. The reason is that withColumn is distributed and order of execution is not proved to be followed in serial manner

How to read the multi nested JSON data in Spark

1 Answers1