1

I have this super nested json file which needs to be in a flat form. Previously I had a similar problem for XML which i solved with the below simple code.

df = pdx.read_xml('C:\\python_script\\temp1\\'+file,encoding='utf-8')
df = pdx.fully_flatten(df)
df = df.pipe(flatten)

Looking for a similar simple code to do the work.

Here is the data. https://www.donneesquebec.ca/recherche/dataset/d23b2e02-085d-43e5-9e6e-e1d558ebfdd5/resource/eb4d7620-6aa3-4850-aab6-a0fbe82f2dc1/download/hebdo_20211227_20220102.json

Any help will be appreciated. :)

  • 1
    what's the shape of the output you want from here? –  Jan 09 '22 at 09:11
  • I need flat sheet with repeated items. for example for each ocid, there are multiple parties which are contained in json, so i need each party with repeated ocids in separate columns. in short, each json value in a separate column. Let me know if you need further clarification. – Jibran Karim Jan 09 '22 at 11:08

2 Answers2

1

You can use json_normalize() which is pretty effective. This article explains it very well for different scenarios: https://towardsdatascience.com/all-pandas-json-normalize-you-should-know-for-flattening-json-13eae1dfb7dd

Here some examples.

some_dict = {
    'key1': 'value1',
    'key2': 'value2',
    'key3': 123,
}
df = pd.json_normalize(some_dict)

WIth multiple levels, if you don't want all of the levels, you can use: pd.json_normalize(data, max_level=1)

With nested lists you can use meta to specify a list of to include:

json_object = {
    'key1': 'value1',
    'key2': 'value2',

    'key3': {
        'key3_1': 'value3',
        'key3_2': {
          'key3_2_1': {
              'admission': 'value4',
              'general': 'value5'
          },
          'key3_3': 'value6',
      }
    },
    'key4': [
      { 'key4_1': 'value7' },
      { 'key4_2': 'value8' },
      { 'key4_3': 'value9' }
    ],
}

# you can do:
pd.json_normalize(
    json_object, 
    record_path =['key4'],
    meta=['key1', ['key3', 'key3_2', 'key3_3']],
)

If you have lists where not all keys are always present, you can use errors='ignore'.

pd.json_normalize(
    json_object, 
    record_path =['key4'],
    meta=['key1', ['key3', 'key3_2', 'key3_3']],
    errors='ignore'
)

By default, nested values will be separated with ., you can change this with sep='':

pd.json_normalize(
    json_object, 
    record_path =['key4'],
    meta=['key1', ['key3', 'key3_2', 'key3_3']],
    errors='ignore',
    sep='-'
)

It depends on where you get your JSON data, from local file or URL. For local file:

import json
# load data using Python JSON module
with open('data/simple.json','r') as f:
    data = json.loads(f.read())
    
# Flattening JSON data
pd.json_normalize(data)

For URLs:

import requestsURL = 'http://raw.githubusercontent.com/BindiChen/machine-learning/master/data-analysis/027-pandas-convert-json/data/simple.json'
data = json.loads(requests.get(URL).text)# Flattening JSON data
pd.json_normalize(data)
timsntech
  • 206
  • 1
  • 6
0

After spending almost 2 days, here is the most simplest solution i could have.

with open('D:\\Json Data.json') as json_data:
    data = json.load(json_data)
dic_flattened = [flatten(d) for d in data['releases']]     
df = pd.DataFrame(dic_flattened)

It goes to the lowest level and make separated columns. The below article helped me. https://pypi.org/project/flatten-json/