Converting multilevel JSON to a dataframe using pyspark

Question

My input JSON file is

{
"Name": "Test",
"Mobile": 12345678,
"Boolean": true,
"Pets": ["Dog", "cat"],
"Address": {
  "Permanent address": "USA",
  "current Address": "AU"
  }
}

The requirement is to convert the above multi-level JSON to dataframe using pyspark.

I tried using the code

path_to_input = "/FileStore/tables/sample_json_file2-6c20f.json"
df = spark.read.json(sc.wholeTextFiles(path_to_input).values())
df.show()

I got the output as

+---------+-------+--------+----+----------+
|  Address|Boolean|  Mobile|Name|      Pets|
+---------+-------+--------+----+----------+
|[USA, AU]|   true|12345678|Test|[Dog, cat]|
+---------+-------+--------+----+----------+

In the address and pets fields i'm getting two values in the same columns. It shouldn't be like an array. I should get like Address_Permanent address as USA, Address_current Address as AU.

What have you tried so far? Can you post some code on the answer? — LeandroHumb, Dec 30 '19 at 13:24
Does this answer your question? [Pyspark - converting json string to DataFrame](https://stackoverflow.com/questions/49675860/pyspark-converting-json-string-to-dataframe) — blackbishop, Dec 30 '19 at 14:03
Does this answer your question? [JSON file parsing in Pyspark](https://stackoverflow.com/questions/41543903/json-file-parsing-in-pyspark) — Oliver W., Dec 31 '19 at 01:53

Saurabh · Answer 1 · 2019-12-31T12:09:32.477

You can try something like below:

schema_json = StructType([StructField("Address", StringType(), True),
                      StructField("Boolean", BooleanType(), True),
                      StructField("Mobile", LongType(), True),
                      StructField("Name", StringType(), True),
                      StructField("Pets", StringType(), True)])

df = spark.read.json(path="/FileStore/tables/sample_json_file2-6c20f.json", schema = schema_json)
df.show(truncate=False)

This will have an output like below:

+--------------------------------------------------+-------+--------+----+-------------+
|Address                                           |Boolean|Mobile  |Name|Pets         |
+--------------------------------------------------+-------+--------+----+-------------+
|{"Permanent address":"USA","current Address":"AU"}|true   |12345678|Test|["Dog","cat"]|
+--------------------------------------------------+-------+--------+----+-------------+

EDIT

If you want Permanent address and current Address in separate columns then you can do like below:

 from pyspark.sql.functions import get_json_object

 df = spark.read.json(path="/FileStore/tables/sample_json_file2-6c20f.json", schema=schema_json)\
        .select("Boolean","Mobile","Name","Pets",get_json_object('Address', "$.Permanent address").alias('Permanent address'),get_json_object('Address', "$.current Address").alias('current Address'))
 df.show(truncate=False)

Output:

+-------+--------+----+-------------+-----------------+---------------+
|Boolean|Mobile  |Name|Pets         |Permanent address|current Address|
+-------+--------+----+-------------+-----------------+---------------+
|true   |12345678|Test|["Dog","cat"]|USA              |AU             |
+-------+--------+----+-------------+-----------------+---------------+

Thankyou.. But we must get Address_Permanent address and Address_current Address as separate columns. — Ravali, Dec 31 '19 at 10:39

Converting multilevel JSON to a dataframe using pyspark

1 Answers1