Generic code to flatten any complicated nested json file using pyspark/pandas

Question

I have a complicated nested json file.i need a generic code which flattens this nested file and stores the result in dataframe using either pyspark or pandas. Is it achievable and is their any generic code which works for any complicated nested json files?

Abhishek K · Answer 1 · 2022-09-02T13:16:20.260

I have added json in data variable. To import json file you can use

df = pd.read_json('data.json')

I have used json_normalize() to flatten nested json data.

Deeply nested JSON structure that can be converted dataframe by passing the meta arguments to the json_normalize function as shown below.

import pandas as pd
data = [
    {
        "company": "Google",
        "tagline": "Hello World",
        "management": {"CEO": "ABC"},
        "department": [
            {"name": "Gmail", "revenue (bn)": 123},
            {"name": "GCP", "revenue (bn)": 400},
            {"name": "Google drive", "revenue (bn)": 600},
        ],
    },
    {
        "company": "Microsoft",
        "tagline": "This is text",
        "management": {"CEO": "XYZ"},
        "department": [
            {"name": "Onedrive", "revenue (bn)": 13},
            {"name": "Azure", "revenue (bn)": 300},
            {"name": "Microsoft 365", "revenue (bn)": 300},
        ],
    },
  
]
df = pd.json_normalize(
    data, "department", ["company", "tagline", ["management", "CEO"]]
)

df

Output

Generic code to flatten any complicated nested json file using pyspark/pandas

1 Answers1