Pyspark - Read complex json file

Question

I've a JSON file allocated in a blob container with this structure:

{
   "tables":[
      {
         "name":"PrimaryResult",
         "columns":[
            {
               "name":"TenantId",
               "type":"string"
            },
            {
               "name":"TimeGenerated",
               "type":"datetime"
            },
            {
               "name":"CorrelationId",
               "type":"string"
            },
            {
               "name":"UserName",
               "type":"string"
            }
         ],
         "rows":[
            [
               "1341234",
               "5143534",
               "14314123",
               "test@test.cloud"
]
    ]
      }
   ]
}

and when I try to run this, it returns this:

I wanted to read the file and dynamically create the table structure for me.

Desired output:

TenantId	TimeGenerated	CorrelationId	UserName
1341234	5143534	14314123	test@test.cloud

Can anyone please help me in achieving this?

Thank you!

Abdennacer Lachiheb · Accepted Answer · 2023-04-21T15:17:06.827

You need to set multiline to true, for multi line json, refer to this answer

Here's the code to read your json and transform it into multiple columns:

df = spark.read.option("inferSchema", "true").option("multiline", "true").json("file.json")
df.withColumn("tables", explode(col("tables"))).select("tables.*")\
    .withColumn("rows", explode(col("rows"))).withColumn("rows", explode(col("rows"))).withColumn("columns", explode(col("columns")))\
    .select("columns.*", "name", "rows").show()

Result:

+-------------+--------+-------------+---------------+
|         name|    type|         name|           rows|
+-------------+--------+-------------+---------------+
|     TenantId|  string|PrimaryResult|        1341234|
|TimeGenerated|datetime|PrimaryResult|        1341234|
|CorrelationId|  string|PrimaryResult|        1341234|
|     UserName|  string|PrimaryResult|        1341234|
|     TenantId|  string|PrimaryResult|        5143534|
|TimeGenerated|datetime|PrimaryResult|        5143534|
|CorrelationId|  string|PrimaryResult|        5143534|
|     UserName|  string|PrimaryResult|        5143534|
|     TenantId|  string|PrimaryResult|       14314123|
|TimeGenerated|datetime|PrimaryResult|       14314123|
|CorrelationId|  string|PrimaryResult|       14314123|
|     UserName|  string|PrimaryResult|       14314123|
|     TenantId|  string|PrimaryResult|test@test.cloud|
|TimeGenerated|datetime|PrimaryResult|test@test.cloud|
|CorrelationId|  string|PrimaryResult|test@test.cloud|
|     UserName|  string|PrimaryResult|test@test.cloud|
+-------------+--------+-------------+---------------+

UPDATE:

If you want to pivot your data, you can just call .pivot(col), note that I applied .first() when pivoting, which is random, if you want to call a specific value than you have to adjust your agg function:

df = spark.read.option("inferSchema", "true").option("multiline", "true").json("file.json")
df = df.withColumn("tables", explode(col("tables"))).select("tables.*") \
    .withColumn("rows", explode(col("rows"))).withColumn("rows", explode(col("rows"))).drop("name")\
    .withColumn("columns", explode(col("columns"))).select("columns.*", "rows").orderBy("name")
df.groupBy(lit(1)).pivot("name").agg(first(col("rows"))).drop("1").show()

Result:

+-------------+--------+-------------+--------+
|CorrelationId|TenantId|TimeGenerated|UserName|
+-------------+--------+-------------+--------+
|      1341234| 1341234|      1341234| 1341234|
+-------------+--------+-------------+--------+

UPDATE 2:

If you want to associate a value from an array column to the same index value from another array column, you can use arrays_zip function:

df = spark.read.option("inferSchema", "true").option("multiline", "true").json("file.json")
df = df.withColumn("tables", explode(col("tables"))).select("tables.*").withColumn("rows", explode(col("rows")))\
    .withColumn("tmp", explode(arrays_zip("columns", "rows"))).select("tmp.columns.name", "tmp.rows")
df.groupBy(lit(1)).pivot("name").agg(first(col("rows"))).drop("1").show()

Result:

+-------------+--------+-------------+---------------+
|CorrelationId|TenantId|TimeGenerated|       UserName|
+-------------+--------+-------------+---------------+
|     14314123| 1341234|      5143534|test@test.cloud|
+-------------+--------+-------------+---------------+

Hi Abdennacer thanks a lot for your help ! I forgot to put the desired output, but I'll try to pick up your solution and apply unpivot/pivot function, once again thanks a lot ! — coding, Apr 21 '23 at 13:02
no problem Abdennacer, thanks a lot :) It's very close to what I want. but if you notice you are repeating the first value in all the columns, and I want for each column to have its respective value, as it is in the desired output. So following the desired output will be: TenantId TimeGenerated CorrelationId UserName 1341234 5143534 14314123 test@test.cloud — coding, Apr 21 '23 at 14:47
Thanks a lot Abdennacer !, you saved my life xD. This is exactly what i wanted :D. Just one question, for that example I just have one row with values, but if I have multiple rows, so for example if the json files coming with this new row: [ "245252", "645424", "23424", "test_one@test.cloud" ] the desired output will be TenantId TimeGenerated CorrelationId UserName 1341234 5143534 14314123 test@test.cloud; 245252, 645424, 23424, test_one@test.cloud. how can i get all newlines cause i think it will only fetch the first one? — coding, Apr 21 '23 at 15:29
@coding instead of lit(1) use col("col") where col is a unique column name, if you don't have a unique column you can create with the monotonically_increasing_id function from spark. — Abdennacer Lachiheb, Apr 21 '23 at 17:18

Pyspark - Read complex json file

1 Answers1