How can metadata be normalized with the json_normalize function

Question

I have a nested json and would like to convert it to a pandas dataframe using the json_normalize function.

JSON

json_input = [{'measurements': [{'value': 111, 'timestamp': 1},
                                {'value': 222, 'timestamp': 2}],
               'sensor': {'name': 'testsensor',
                          'id': 1}},
              {'measurements': [{'value': 333, 'timestamp': 1},
                                {'value': 444, 'timestamp': 2}],
               'sensor': None},
              ]

Normalizing

df = pd.json_normalize(json_input, record_path=['measurements'],
                                   meta=['sensor'])

The metadata does not get normalized in the output of the above code:

|   | value | timestamp | sensor                          |
|---|-------|-----------|---------------------------------|
| 0 | 111   | 1         | {'name': 'testsensor', 'id': 1} |
| 1 | 222   | 2         | {'name': 'testsensor', 'id': 1} |
| 2 | 111   | 1         | None                            |
| 3 | 222   | 2         | None                            |

Is there a possibility to get the desired output:

|   | value | timestamp | sensor.name  | sensor.id |
|---|-------|-----------|--------------|-----------|
| 0 | 111   | 1         | 'testsensor' | 1         |
| 1 | 222   | 2         | 'testsensor' | 1         |
| 2 | 111   | 1         | None         | None      |
| 3 | 222   | 2         | None         | None      |

jezrael · Accepted Answer · 2020-05-28T07:39:02.393

1

Create DataFrame by constructor with replace empty lists to empty dicts and join together by concat:

df = pd.json_normalize(json_input, record_path=['measurements'],
                                   meta=['sensor'])

#pandas 1.0.1
df1 = pd.DataFrame([{} if x == [] else x for x in df.pop('sensor')]).add_prefix("sensor.")
#pandas 1.0.3
df1 = pd.DataFrame([{} if x == None else x for x in df.pop('sensor')]).add_prefix("sensor.")

df = pd.concat([df, df1], axis=1)
print (df)
   value  timestamp sensor.name  sensor.id
0    111          1  testsensor        1.0
1    222          2  testsensor        1.0
2    333          1         NaN        NaN
3    444          2         NaN        NaN

edited May 28 '20 at 07:39

answered May 28 '20 at 06:49

jezrael

822,522
95
1,334
1,252

Thanks for the answer! Works fine. I will just evaluate the comment that you did on @Pygirls answer – elyptikus May 28 '20 at 07:03
Its not the main requirement. But I will check it out! – elyptikus May 28 '20 at 07:05
@elyptikus - super, let me know. – jezrael May 28 '20 at 07:05
@jezrael: This doesn't work in my case. It gives :`AttributeError: 'NoneType' object has no attribute 'keys'` Because there are none values also. I am using `pandas 1.0.3` – Pygirl May 28 '20 at 07:17
It should be `if x == None` Then it works fine for me. – Pygirl May 28 '20 at 07:20
I am using 1.0.3 version. I have updated my answer `pd.DataFrame([i if i!=None else {} for i in df['sensor'].tolist()]` This will be faster than apply right? – Pygirl May 28 '20 at 07:38

Pygirl · Answer 2 · 2020-05-28T07:38:11.023

This will do-> df['sensor'].apply(pd.Series).add_prefix("sensor.")]

df = pd.json_normalize(json_input, record_path=['measurements'],
                                   meta=['sensor'])
df = pd.concat([df, df['sensor'].apply(pd.Series).add_prefix("sensor.")], axis=1)
df.drop('sensor', inplace=True, axis=1)
df

    value   timestamp   sensor.name sensor.id
0   111     1           testsensor  1.0
1   222     2           testsensor  1.0
2   333     1           NaN         NaN
3   444     2           NaN         NaN

As mentioned by jezrael. .apply(pd.series) is slow you can use this:

pd.DataFrame([i if i!=None else {} for i in df['sensor'].tolist()]

df = pd.json_normalize(json_input, record_path=['measurements'],
                                   meta=['sensor'])
df = pd.concat([df, pd.DataFrame([i if i!=None else {} for i in df['sensor'].tolist()]
).add_prefix("sensor")], axis=1)
df.drop('sensor', inplace=True, axis=1)
df

I think `apply(pd.Series)` should be avoided, because slow, [link](https://stackoverflow.com/a/35491399/2901002) — jezrael, May 28 '20 at 06:51
Thanks for the answer @Pygirl! I tested it and it works fine for me. — elyptikus, May 28 '20 at 07:41

How can metadata be normalized with the json_normalize function

2 Answers2