Reading nested Json structure in PySpark

Question

I am new to PySpark.I am trying to read the values for one of the nested column of my JSON data.Here is my json structure:

-- _index: string (nullable = true)
 |-- _score: string (nullable = true)
 |-- _source: struct (nullable = true)
 |    |-- layers: struct (nullable = true)
 |    |    |-- R1.TEST6: struct (nullable = true)
 |    |    |    |-- R1.TEST1: struct (nullable = true)
 |    |    |    |    |-- R1.TEST1.idx: string (nullable = true)
 |    |    |    |    |-- R1.TEST1.ide: string (nullable = true)
 |    |    |    |-- R1.TEST3: struct (nullable = true)
 |    |    |    |    |-- R1.TEST3.PDU: string (nullable = true)
 |    |    |    |    |-- R1.TEST3.pdu: string (nullable = true)
 |    |    |    |    |-- R1.TEST4: struct (nullable = true)
 |    |    |    |    |    |-- R1.TEST2: struct (nullable = true)
 |    |    |    |    |    |    |-- R1.TEST2.agg: string (nullable = true)
 |    |    |    |    |    |    |-- R1.TEST2.size: string (nullable = true)
 |    |    |    |    |    |    |-- R1.TEST2.start: string (nullable = true)
 |    |    |    |    |    |    |-- R1.TEST2.beam: string (nullable = true)
 |    |    |    |    |    |    |-- R1.TEST2.startIndex: string (nullable = true)
 |    |    |    |    |    |    |-- R1.TEST2.regType: string (nullable = true)
 |    |    |    |    |    |    |-- R1.TEST2.coreSetType: string (nullable = true)
 |    |    |    |    |    |    |-- R1.TEST2.cpType: string (nullable = true)
 |    |    |    |    |    |    |-- R1.TEST2.column1: string (nullable = true)
 |    |    |    |    |    |    |-- R1.TEST2.column1: string (nullable = true)
 |    |    |    |    |    |    |-- R1.TEST2.column1: string (nullable = true)
 |    |    |    |    |    |    |-- R1.TEST2.column1: string (nullable = true)
 |    |    |    |    |    |    |-- R1.TEST2.column1: string (nullable = true)
 |    |    |    |    |    |    |-- R1.TEST2.column1: string (nullable = true)
 |    |    |    |    |    |    |-- R1.TEST2.column3: string (nullable = true)

As mentioned over the article,https://stackoverflow.com/questions/57811415/reading-a-nested-json-file-in-pyspark,I tried doing below:

df2 = df.select(F.array(F.expr("_source.*")).alias("Source"))

Now my requirement is to access the value that is underR1.TEST6: tag

But below code is not working:

df2.withColumn("source_data", F.explode(F.arrays_zip("Source"))).select("source_data.Source.R1.TEST6.R1.TEST1.idx").show()

Can someone please help me on how can I access all the fields of this nested JSON and create a table as there are multiple levels of nesting present in this JSON _source.R1.TEST6 So how to use explode at this many multiple levels under

Try `df.select("_source.layers.``R1.TEST6``.``R1.TEST1``.*").show()` — blackbishop, Feb 25 '21 at 20:06

Reading nested Json structure in PySpark

0 Answers0