Reading complex nested json file in pyspark

Question

I'm having troubles for some days trying to resolve this.

I have a nested json file with a complex schema (array inside structure, structure inside array) and I need to put data in dataframe.

What I have in input is this (as an example):

+-----+----------------+-----------------------------------+---------+
| id  | name           | detail                            | item    |
+-----+----------------+-----------------------------------+---------+
| 100 | Peter Castle   | [[D100A, Credit],[D100B, Debit]]  | [10,31] |
| 101 | Quino Yukimori | [[D101A, Credit],[D101B, Credit]] | [55,49] |
+-----+----------------+-----------------------------------+---------+

I should read like this

+-----+----------------+-----------+--------+-----------+
| id  | name           | detail_id | type   | item_qty  |
+-----+----------------+-----------+--------+-----------+
| 100 | Peter Castle   | D100A     | Credit | 10        |
| 100 | Peter Castle   | D100B     | Debit  | 31        |
| 101 | Quino Yukimori | D101A     | Credit | 55        |
| 101 | Quino Yukimori | D101B     | Credit | 49        |
+-----+----------------+-----------+--------+-----------+

But what I get is this:


df.withColumn('detail', explode('detail')).withColumn('item', explode('item'))

+-----+----------------+-----------+--------+-----------+
| id  | name           | detail_id | type   |  item_qty |
+-----+----------------+-----------+--------+-----------+
| 100 | Peter Castle   | D100A     | Credit | 10        |
| 100 | Peter Castle   | D100A     | Debit  | 10        |
| 100 | Peter Castle   | D100B     | Credit | 31        |
| 100 | Peter Castle   | D100B     | Debit  | 31        |
| 101 | Quino Yukimori | D101A     | Credit | 55        |
| 101 | Quino Yukimori | D101A     | Credit | 55        |
| 101 | Quino Yukimori | D101B     | Credit | 49        |
| 101 | Quino Yukimori | D101B     | Credit | 49        |
+-----+----------------+-----------+--------+-----------+

I have tried combining columns with arrays_zip and then explode, but the problem is there is array inside array, and if I explode detail array columns, the explode of item array columns multiply data.

Any idea how can I implement that?

Sorry about my english, is not my birth language.

UPDATED

Here is my schema, which complicates me reading it for the multiple nested arrays:

 |-- id: string(nullable = true)
 |-- name: string(nullable = true)
 |-- detail: array (nullable = true)
 |   |-- element: struct (containsNull = true)
 |   |    |-- detail_id: string(nullable = true)
 |   |    |-- type: string(nullable = true)
 |-- item: array (nullable = true)
 |   |-- element: struct (containsNull = true)
 |   |    |-- item_qty : long(nullable = true)
 |-- deliveryTrack: array (nullable = true)
 |   |-- element: struct (containsNull = true)
 |   |    |-- date: string(nullable = true)
 |   |    |-- track: array (nullable = true)
 |   |    |   |-- element: struct (containsNull = true)
 |   |    |   |   |-- time: string (nullable = true)
 |   |    |   |   |-- driver: string (nullable = true)

Also post the code that gave you that output – Bala Jul 20 '21 at 20:10 — Bala, Jul 20 '21 at 20:10

AdibP · Accepted Answer · 2021-07-21T19:26:11.810

2

Use explode only once after you zip both arrays with arrays_zip. After that, use the expr function to get the data.

from pyspark.sql.functions import explode, arrays_zip, col, expr

df1 = (df
      .withColumn('buffer', explode(arrays_zip(col('detail'), col('item'))))
      .withColumn('detail_id', expr("buffer.detail.detail_id"))
      .withColumn('type', expr("buffer.detail.type"))
      .withColumn('item_qty', expr("buffer.item.item_qty"))
      .drop(*['detail', 'item', 'buffer'])
    )
df1.show()

+---+--------------+---------+------+--------+
|id |name          |detail_id|type  |item_qty|
+---+--------------+---------+------+--------+
|100|Peter Castle  |D100A    |Credit|10      |
|100|Peter Castle  |D100B    |Debit |31      |
|101|Quino Yukimori|D101A    |Credit|55      |
|101|Quino Yukimori|D101B    |Credit|49      |
+---+--------------+---------+------+--------+

edited Jul 21 '21 at 19:26

answered Jul 21 '21 at 01:12

AdibP

2,819
1
10
24

Hello, and thanks for helping me. I've tried your solution bu get this error: `AnalysisException: cannot resolve 'element_at(`buffer`.`detail`, 1)' due to data type mismatch: The first argument to function element_at should have been array or map type, but its struct...` – Wesley Romero Jul 21 '21 at 16:07
I've updated my question adding the schema for better understanding – Wesley Romero Jul 21 '21 at 16:59
Tried to replicate your schema, I've edited my answer. – AdibP Jul 21 '21 at 19:03
Hey, it works! I'm so thankful- But what can I do if need to get time and driver columns, which are inside deliveryTrack array? Could you help me with that?. Because while more arrays I have to read, sames problems ocurrs. However, You answered my original question, thats why I marked as Answered. – Wesley Romero Jul 21 '21 at 20:48
you can include column `deliveryTrack` inside `arrays_zip`, then access the data later using nested column name like `"buffer.deliveryTrack.track.driver"` – AdibP Jul 21 '21 at 22:26
I've tried that, but got this error: `AnalysisException: cannot resolve 'buffer'.'deliveryTrack'.'track'['driver'] due to data type mismatch: argument 2 requires integral type, however, 'driver' is of string type` – Wesley Romero Jul 21 '21 at 23:29
does `buffer.deliveryTrack.track[0].driver` work? anyway you can ask a new question if needed or you can read this thread here to get technical on nested schema https://stackoverflow.com/questions/28332494/querying-spark-sql-dataframe-with-complex-types – AdibP Jul 22 '21 at 00:54
I'll try that. Thanks for your help! – Wesley Romero Jul 22 '21 at 13:46

Reading complex nested json file in pyspark

UPDATED

1 Answers1