Accessing nested data in JSON file to build multiple dataframes

Question

I'm currently accessing a JSON file to gather information about companies, branches and workers. In this JSON there are multiple companies, in which every company has its own branches and each branch has its own workers. Below there's an example of the JSON structure I'm working on.

{
  "company": [
    {
      "companyName": "MyCompany",
      "branches": [
        {
          "branchName": "My First Branch",
          "workers": [
            {
              "workerName": "John",
              "wage": "10"
            },
            {
              "workerName": "Lucas",
              "wage": "20"
            }
          ]
        },
        {
          "branchName": "My Second Branch",
          "workers": [
            {
              "workerName": "Mary",
              "wage": "30"
            },
            {
              "workerName": "Jack",
              "wage": "40"
            }
          ]
        }
      ]
    },
    {
      "companyName": "YourCompany",
      "branches": [
        {
          "branchName": "Your First Branch",
          "workers": [
            {
              "workerName": "George",
              "wage": "15"
            },
            {
              "workerName": "Harry",
              "wage": "25"
            }
          ]
        },
        {
          "branchName": "Your Second Branch",
          "workers": [
            {
              "workerName": "Wayne",
              "wage": "35"
            },
            {
              "workerName": "Rose",
              "wage": "45"
            }
          ]
        }
      ]
    }
  ]
}

My goal is to gather informations regarding all companies, branches and workers in one dataframe for each of them. For that, I'm currently using loops, as shown below.

Companies = my_json['Companies'] #accessing list of companies
Branches = None
Workers = None
for i in range(len(Companies)):
  company_branches = Companies[i]['Branches'] #accessing branches for that company
  if(Branches is None):
    Branches = pd.DataFrame(company_branches)
  else:
    Branches = pd.concat([Branches,pd.DataFrame(company_branches)])
  for j in range(len(company_branches)):
    branch_workers = company_branches[j]['Workers'] #accessing workers for that branch
    if(Workers is None):
      Workers = pd.DataFrame(branch_workers)
    else:
      Workers = pd.concat([Workers,pd.DataFrame(branch_workers)])
Companies = pd.DataFrame(Companies)

This solves my problem, creating the three desired dataframes (with some extra columns for companies and branches which I would still drop), but I'm currently having performance issues. I've been trying to solve this problem without loops, but I can't create the dataframes correctly. If I try

Companies = pd.DataFrame(my_json['companies'])

it creates the dataframe correctly, but then if I try

Branches = pd.DataFrame(Companies.branches.values)

it isn't created correctly. It doesn't raise an error, but it's basically copying the column from the Companies dataframe, creating a single column with the JSON code for the branches for each company. In this case, what I wanted was as many columns as the amount of attributes the branches have.

Any tips on how to solve this problem in an efficient way?

score 1 · Accepted Answer · answered Jan 20 '20 at 18:47

You should probably keep all of your data in one dataframe (tidy data). If you need, you could then get a dataframe for companies or branches, e.g. pd.DataFrame(df.company.unique(), columns='company name').

The method below uses a nested list comprehension to flatten your data. It also converts the wage record from a string to a float.

df = pd.DataFrame(
    [(company.get('companyName'), branch.get('branchName'), 
      worker.get('workerName'), float(worker.get('wage', 0))) 
     for company in my_json['company']
     for branch in company['branches'] 
     for worker in branch.get('workers')
], columns=['company', 'branch', 'worker', 'wage'])

>>> df
       company              branch  worker  wage
0    MyCompany     My First Branch    John  10.0
1    MyCompany     My First Branch   Lucas  20.0
2    MyCompany    My Second Branch    Mary  30.0
3    MyCompany    My Second Branch    Jack  40.0
4  YourCompany   Your First Branch  George  15.0
5  YourCompany   Your First Branch   Harry  25.0
6  YourCompany  Your Second Branch   Wayne  35.0
7  YourCompany  Your Second Branch    Rose  45.0

score 0 · Answer 2 · answered Jan 20 '20 at 18:42

This is my solution to the problem in a generalized way. Given I define the whole json your provide as data, then:

to_df = {'companyName':[],'branchName':[],'workerName':[],'wage':[]}
for i in range(len(data['company'])):
    for j in range(len(data['company'][i]['branches'])):
        for k in range(len(data['company'][i]['branches'][j]['workers'])):
            to_df['companyName'].append(data['company'][i]['companyName'])
            to_df['branchName'].append(data['company'][i]['branches'][j]['branchName'])
            to_df['workerName'].append(data['company'][i]['branches'][j]['workers'][k]['workerName'])
            to_df['wage'].append(data['company'][i]['branches'][j]['workers'][k]['wage'])
df = pd.DataFrame(to_df)
print(df)

Output:

   companyName          branchName workerName wage
0    MyCompany     My First Branch       John   10
1    MyCompany     My First Branch      Lucas   20
2    MyCompany    My Second Branch       Mary   30
3    MyCompany    My Second Branch       Jack   40
4  YourCompany   Your First Branch     George   15
5  YourCompany   Your First Branch      Harry   25
6  YourCompany  Your Second Branch      Wayne   35
7  YourCompany  Your Second Branch       Rose   45

You can check this answer for some extra information regarding nested jsons: How to extract nested JSON data?

Accessing nested data in JSON file to build multiple dataframes

2 Answers2