10

I am trying to flatten a JSON file that looks like this:

{
"teams": [
  {
    "teamname": "1",
    "members": [
      {
        "firstname": "John", 
        "lastname": "Doe",
        "orgname": "Anon",
        "phone": "916-555-1234",
        "mobile": "",
        "email": "john.doe@wildlife.net"
      },
      {
        "firstname": "Jane",
        "lastname": "Doe",
        "orgname": "Anon",
        "phone": "916-555-4321",
        "mobile": "916-555-7890",
        "email": "jane.doe@wildlife.net"
      }
    ]
  },
  {
    "teamname": "2",
    "members": [
      {
        "firstname": "Mickey",
        "lastname": "Moose",
        "orgname": "Moosers",
        "phone": "916-555-0000",
        "mobile": "916-555-1111",
        "email": "mickey.moose@wildlife.net"
      },
      {
        "firstname": "Minny",
        "lastname": "Moose",
        "orgname": "Moosers",
        "phone": "916-555-2222",
        "mobile": "",
        "email": "minny.moose@wildlife.net"
      }
    ]
  }       
]

}

I wish to export this to an excel table. My current code is this:

from pandas.io.json import json_normalize
import json
import pandas as pd

inputFile = 'E:\\teams.json'
outputFile = 'E:\\teams.xlsx'

f = open(inputFile)
data = json.load(f)
f.close()

df = pd.DataFrame(data)

result1 = json_normalize(data, 'teams' )
print result1

results in this output:

members                                              teamname
0  [{u'firstname': u'John', u'phone': u'916-555-...        1
1  [{u'firstname': u'Mickey', u'phone': u'916-555-...      2

There are 2 members's data nested within each row. I would like to have an output table that displays all 4 members' data plus their associated teamname.

piRSquared
  • 285,575
  • 57
  • 475
  • 624
spaine
  • 443
  • 1
  • 5
  • 17

2 Answers2

15

Use pandas.io.json.json_normalize

json_normalize(data,record_path=['teams','members'],meta=[['teams','teamname']])

output:
         email                firstname lastname mobile      orgname    phone       teams.teamname
0   john.doe@wildlife.net       John    Doe                   Anon      916-555-1234    1
1   jane.doe@wildlife.net       Jane    Doe     916-555-7890  Anon      916-555-4321    1
2   mickey.moose@wildlife.net   Mickey  Moose   916-555-1111  Moosers   916-555-0000    2
3   minny.moose@wildlife.net    Minny   Moose                 Moosers   916-555-2222    2


Explanation

from pandas.io.json import json_normalize
import pandas as pd

I've only learned how to use the json_normalize function recently so my explanation might not be right.

Start with what I'm calling 'Layer 0'

json_normalize(data)

output:
     teams
0   [{'teamname': '1', 'members': [{'firstname': '...

There is 1 Column and 1 Row. Everything is inside the 'team' column.

Look into what I'm calling 'Layer 1' by using record_path=

json_normalize(data,record_path='teams')

output:
     members                                          teamname
0   [{'firstname': 'John', 'lastname': 'Doe', 'org...    1
1   [{'firstname': 'Mickey', 'lastname': 'Moose', ...    2

In Layer 1 we have have flattened 'teamname' but there is more inside 'members'.

Look into Layer 2 with record_path=. The notation is unintuitive at first. I now remember it by ['layer','deeperlayer'] where the result is layer.deeperlayer.

json_normalize(data,record_path=['teams','members'])

output:
           email              firstname lastname   mobile     orgname   phone
0   john.doe@wildlife.net      John        Doe                  Anon    916-555-1234
1   jane.doe@wildlife.net       Jane        Doe   916-555-7890  Anon    916-555-4321
2   mickey.moose@wildlife.net   Mickey     Moose   916-555-1111 Moosers 916-555-0000
3   minny.moose@wildlife.net    Minny       Moose               Moosers 916-555-2222

Excuse my output, I don't know how to make tables in a response.

Finally we add in Layer 1 columns using meta=

json_normalize(data,record_path=['teams','members'],meta=[['teams','teamname']])

output:
         email                firstname lastname mobile      orgname    phone       teams.teamname
0   john.doe@wildlife.net       John    Doe                   Anon      916-555-1234    1
1   jane.doe@wildlife.net       Jane    Doe     916-555-7890  Anon      916-555-4321    1
2   mickey.moose@wildlife.net   Mickey  Moose   916-555-1111  Moosers   916-555-0000    2
3   minny.moose@wildlife.net    Minny   Moose                 Moosers   916-555-2222    2

Notice how we needed a list of lists for meta=[[]] to reference Layer 1. If there was a column we want from Layer 0 and Layer 1 we could do this:

json_normalize(data,record_path=['layer1','layer2'],meta=['layer0',['layer0','layer1']])

The result of the json_normalize is a pandas dataframe.

Trenton McKinney
  • 56,955
  • 33
  • 144
  • 158
Jason Chiu
  • 151
  • 1
  • 3
11

This is one way to do it. Should give you some ideas.

df = pd.concat(
    [
        pd.concat([pd.Series(m) for m in t['members']], axis=1) for t in data['teams']
    ], keys=[t['teamname'] for t in data['teams']]
)

                                     0                         1
1 email          john.doe@wildlife.net     jane.doe@wildlife.net
  firstname                       John                      Jane
  lastname                         Doe                       Doe
  mobile                                            916-555-7890
  orgname                         Anon                      Anon
  phone                   916-555-1234              916-555-4321
2 email      mickey.moose@wildlife.net  minny.moose@wildlife.net
  firstname                     Mickey                     Minny
  lastname                       Moose                     Moose
  mobile                  916-555-1111                          
  orgname                      Moosers                   Moosers
  phone                   916-555-0000              916-555-2222

To get a nice table with team name and members as rows, all attributes in columns:

df.index.levels[0].name = 'teamname'
df.columns.name = 'member'

df.T.stack(0).swaplevel(0, 1).sort_index()

enter image description here

To get team name and member as actual columns, just reset the index.

df.index.levels[0].name = 'teamname'
df.columns.name = 'member'

df.T.stack(0).swaplevel(0, 1).sort_index().reset_index()

enter image description here

The whole thing

import json
import pandas as pd

json_text = """{
"teams": [
  {
    "teamname": "1",
    "members": [
      {
        "firstname": "John", 
        "lastname": "Doe",
        "orgname": "Anon",
        "phone": "916-555-1234",
        "mobile": "",
        "email": "john.doe@wildlife.net"
      },
      {
        "firstname": "Jane",
        "lastname": "Doe",
        "orgname": "Anon",
        "phone": "916-555-4321",
        "mobile": "916-555-7890",
        "email": "jane.doe@wildlife.net"
      }
    ]
  },
  {
    "teamname": "2",
    "members": [
      {
        "firstname": "Mickey",
        "lastname": "Moose",
        "orgname": "Moosers",
        "phone": "916-555-0000",
        "mobile": "916-555-1111",
        "email": "mickey.moose@wildlife.net"
      },
      {
        "firstname": "Minny",
        "lastname": "Moose",
        "orgname": "Moosers",
        "phone": "916-555-2222",
        "mobile": "",
        "email": "minny.moose@wildlife.net"
      }
    ]
  }       
]
}"""


data = json.loads(json_text)

df = pd.concat(
    [
        pd.concat([pd.Series(m) for m in t['members']], axis=1) for t in data['teams']
    ], keys=[t['teamname'] for t in data['teams']]
)

df.index.levels[0].name = 'teamname'
df.columns.name = 'member'

df.T.stack(0).swaplevel(0, 1).sort_index().reset_index()
piRSquared
  • 285,575
  • 57
  • 475
  • 624
  • Thank you, but what I am trying to achieve is a table that has names, email, etc as column headings with the team number as an additional column heading. That way the four people are represented by four flattened records. – spaine Jun 07 '16 at 15:35
  • This was intended to show you the sort of techniques that can be used to accomplish your goal. To get what you want, you now need to pivot. I'll update my answer to reflect this. – piRSquared Jun 07 '16 at 16:06
  • The example you've created looks perfect. When I run this the following error gets returned at the df.index.levels[0].name = 'teamname' line: AttributeError: 'Int64Index' object has no attribute 'levels' – spaine Jun 07 '16 at 19:14
  • That's my bad. I edited the post. I forgot to assign the `pd.concat` to the dataframe `df` – piRSquared Jun 07 '16 at 19:17
  • This is very helpful to me - I'm going to have to do some research to understand exactly how your code works. May I ask how you produced the data in table form that you posted above? – spaine Jun 07 '16 at 19:46
  • This solution is very specific to the json structure you provided. The html table is output while using jupyter-notebook. You can access the same html via df.to_html(). – piRSquared Jun 07 '16 at 19:48
  • Hmm, when I include either or both of the last 2 scripts to turn it into a table, it remains as you show after pd.concat... – spaine Jun 07 '16 at 22:05
  • Ok, I was basing my code on an assumption that your data was the same as mine. I'll update the code again with the exact thing I ran. – piRSquared Jun 07 '16 at 22:50
  • I was using the same data, just loading from a .json file. Even if I copy the entire text of the code into a python editor, printing the result displays what you show after pd.concat... – spaine Jun 08 '16 at 16:09
  • Are you looking at `df` or `df.T.stack(0).swaplevel(0, 1).sort_index().reset_index()`? The latter doesn't reassign to `df`. So if you look at `df` after you've run `df.T.stack(0).swaplevel(0, 1).sort_index().reset_index()`, `df` will still look like it did after the `pd.concat` – piRSquared Jun 08 '16 at 16:14
  • That was it, I thought df had been reassigned. I assigned df.T.stack(0).swaplevel(0, 1).sort_index().reset_index() as an object and the export to Excel worked perfectly. Thank you very much. – spaine Jun 08 '16 at 16:31
  • @piRSquared Amazing solution. Hopefully, I can make it work when, for example, the phone has many entries e.g. ```"phone": [{"home": "1111", "mobile": "2222"},{"home": "3333", "mobile": "44444"}``` – Datacrawler Jul 29 '19 at 14:56