0

I'm trying to create a dataframe using the following JSON structure -

{
    "tables" : {
        "name" : "PrimaryResult",
        "columns" : [
            {
                "name" : "EmployeeID",
                "type" : "Int"
            },
            {
                "name" : "EmployeeName",
                "type" : "String"
            },
            {
                "name" : "DepartmentName",
                "type" : "String"
            }
        ],
        "rows"  : [
            [
                123,
                "John Doe",
                "IT"
            ],
            [   
                234,
                "Jane Doe",
                "HR"
            ]
        ]
    }
}

I tried few of the suggestions from - How to create pandas DataFrame from nested Json with list , How to parse nested JSON objects in spark sql?.
But I'm still confused. Essentially the output should look somewhat like below -

+----------+------------+--------------+
|EmployeeId|EmployeeName|DepartmentName|
+----------+------------+--------------+
|       123|    John Doe|            IT|
|       234|    Jane Doe|            HR|
+----------+------------+--------------+

I'm trying to refrain from using pandas as it shows lot of memory issues if the data is huge (not sure if there is a way to handle them).

Please help.

1 Answers1

0

See below logic -

import json

data = [json.loads(js)]

print(data)

# Output
[{'tables': {'name': 'PrimaryResult', 'columns': [{'name': 'EmployeeID', 'type': 'Int'}, {'name': 'EmployeeName', 'type': 'String'}, {'name': 'DepartmentName', 'type': 'String'}], 'rows': [[123, 'John Doe', 'IT'], [234, 'Jane Doe', 'HR']]}}]

Now fetch the columns as below -

columns = []

for i in range(len(data[0]['tables']['columns'])):
  columns.append(data[0]['tables']['columns'][i]['name'])

print(columns)

#Output
['EmployeeID', 'EmployeeName', 'DepartmentName']

Create a dictionary of columns and rows as below -

dict_JSON = {}

dict_JSON["columns"] = columns
dict_JSON["data"] = data[0]['tables']['rows']

print(dict_JSON)

#Output
{'columns': ['EmployeeID', 'EmployeeName', 'DepartmentName'], 'data': [[123, 'John Doe', 'IT'], [234, 'Jane Doe', 'HR']]}

Now once you have this dictionary create pandas dataframe and from there create the spark dataframe as below -

import pandas as pd

pdf = pd.read_json(json.dumps(dict_JSON), orient='split')

df = spark.createDataFrame(pdf)
df.show()

+----------+------------+--------------+
|EmployeeID|EmployeeName|DepartmentName|
+----------+------------+--------------+
|       123|    John Doe|            IT|
|       234|    Jane Doe|            HR|
+----------+------------+--------------+
Dipanjan Mallick
  • 1,636
  • 2
  • 8
  • 20