How to flatten a pandas dataframe with some columns as json?

Question

I have a dataframe df that loads data from a database. Most of the columns are json strings while some are even list of jsons. For example:

id     name     columnA                               columnB
1     John     {"dist": "600", "time": "0:12.10"}    [{"pos": "1st", "value": "500"},{"pos": "2nd", "value": "300"},{"pos": "3rd", "value": "200"}, {"pos": "total", "value": "1000"}]
2     Mike     {"dist": "600"}                       [{"pos": "1st", "value": "500"},{"pos": "2nd", "value": "300"},{"pos": "total", "value": "800"}]
...

As you can see, not all the rows have the same number of elements in the json strings for a column.

What I need to do is keep the normal columns like id and name as it is and flatten the json columns like so:

id    name   columnA.dist   columnA.time   columnB.pos.1st   columnB.pos.2nd   columnB.pos.3rd     columnB.pos.total
1     John   600            0:12.10        500               300               200                 1000 
2     Mark   600            NaN            500               300               Nan                 800

I have tried using json_normalize like so:

from pandas.io.json import json_normalize
json_normalize(df)

But there seems to be some problems with keyerror. What is the correct way of doing this?

What about values in column B ? You want to flatten the dictionaries also ? — MMF, Oct 06 '16 at 14:44
yes. they need to be flattened as well. there was a typo in the original question where I put columnA for all the flattened column but corrected it now. — sfactor, Oct 06 '16 at 14:51

Nickil Maveli · Accepted Answer · 2017-11-12T08:22:58.260

58

Here's a solution using json_normalize() again by using a custom function to get the data in the correct format understood by json_normalize function.

import ast
from pandas.io.json import json_normalize

def only_dict(d):
    '''
    Convert json string representation of dictionary to a python dict
    '''
    return ast.literal_eval(d)

def list_of_dicts(ld):
    '''
    Create a mapping of the tuples formed after 
    converting json strings of list to a python list   
    '''
    return dict([(list(d.values())[1], list(d.values())[0]) for d in ast.literal_eval(ld)])

A = json_normalize(df['columnA'].apply(only_dict).tolist()).add_prefix('columnA.')
B = json_normalize(df['columnB'].apply(list_of_dicts).tolist()).add_prefix('columnB.pos.')

Finally, join the DFs on the common index to get:

df[['id', 'name']].join([A, B])

EDIT:- As per the comment by @MartijnPieters, the recommended way of decoding the json strings would be to use json.loads() which is much faster when compared to using ast.literal_eval() if you know that the data source is JSON.

edited Nov 12 '17 at 08:22

answered Oct 06 '16 at 21:57

Nickil Maveli

29,155
8
82
85

1

Great thanks for the answer ! one thing though, is the returned lists on the list_of_dicts (list(d.values())[0], list(d.values())[1]), and not the other way round? Otherwise this worked perfect for me. – sfactor Oct 07 '16 at 06:17
1

As you would know that `dictionaries` do not preserve the order while performing iteration, the values present in the `dict` were appearing in the order opposite to that of yours and hence was the need to use the slicing notation differently compared to yours. If it's appearing in the same order as you've mentioned, go ahead with it or you can even make use of an [`Ordered Dict`](https://docs.python.org/3/library/collections.html#collections.OrderedDict) to preserve the order if you want to. – Nickil Maveli Oct 07 '16 at 08:22
3

Why the (slow!) `ast.literal_eval()` call when you should be using `json.loads()`? The latter handles correct JSON data, the former only *Python* syntax, which *differs materially* when it comes to booleans, nulls and unicode data outside of the BMP. – Martijn Pieters Nov 11 '17 at 21:37
@MartijnPieters: Thanks for the comment. I've updated my post. – Nickil Maveli Nov 12 '17 at 08:25
Not only is it faster, it'll also avoid `ValueError` exceptions when `true`, `false` or `null` values are involved. JSON is not Python. – Martijn Pieters Nov 13 '17 at 13:17
@NickilMaveli how do I use this if I have `nan` value in column A, I am getting an error `malformed node or string: nan` when trying to use the `only_dict` function, but when I remove the 'nan' values from the column A it is fine. Thank you – Durga Gaddam Apr 09 '18 at 21:49
1

If your data contains null, you can update the `only_dict` method to: `return ast.literal_eval(d) if pd.notnull(d) else {}` Otherwise, it returns `ValueError: malformed node or string: nan` – E. Zeytinci Nov 18 '19 at 21:03

score 53 · Answer 2 · edited Aug 07 '21 at 20:43

53

The quickest seems to be:

import pandas as pd
import json

json_struct = json.loads(df.to_json(orient="records"))    
df_flat = pd.io.json.json_normalize(json_struct) #use pd.io.json

edited Aug 07 '21 at 20:43

LucyDrops

539
5
15

answered Mar 21 '19 at 11:48

staonas

597
4
5

2

This was definitely the simplest method and the one that worked for me. Only caveat is your nested objects will end up with long names (data.level1.level2.level3 ...etc) – TheTiGuR Feb 26 '20 at 18:48
This is definitely my chosen answer - works perfectly and very concise solution. Thanks! – Atlas7 Sep 19 '20 at 12:28
This is the best answer! – Srivatsan Jun 20 '22 at 10:04
2

gotta get me some `orient="records` and `pd.io.json.json_normalize` . now to stick that stuff into long-term memory brain cells – WestCoastProjects Sep 23 '22 at 04:11

score 24 · Answer 3 · edited Apr 24 '23 at 12:11

TL;DR Copy-paste the following function and use it like this: flatten_nested_json_df(df)

This is the most general function I could come up with:

def flatten_nested_json_df(df):
    
    df = df.reset_index()
    
    print(f"original shape: {df.shape}")
    print(f"original columns: {df.columns}")
    
    
    # search for columns to explode/flatten
    s = (df.applymap(type) == list).all()
    list_columns = s[s].index.tolist()
    
    s = (df.applymap(type) == dict).all()
    dict_columns = s[s].index.tolist()
    
    print(f"lists: {list_columns}, dicts: {dict_columns}")
    while len(list_columns) > 0 or len(dict_columns) > 0:
        new_columns = []
        
        for col in dict_columns:
            print(f"flattening: {col}")
            # explode dictionaries horizontally, adding new columns
            horiz_exploded = pd.json_normalize(df[col]).add_prefix(f'{col}.')
            horiz_exploded.index = df.index
            df = pd.concat([df, horiz_exploded], axis=1).drop(columns=[col])
            new_columns.extend(horiz_exploded.columns) # inplace
        
        for col in list_columns:
            print(f"exploding: {col}")
            # explode lists vertically, adding new columns
            df = df.drop(columns=[col]).join(df[col].explode().to_frame())
            # Prevent combinatorial explosion when multiple
            # cols have lists or lists of lists
            df = df.reset_index(drop=True)
            new_columns.append(col)
        
        # check if there are still dict o list fields to flatten
        s = (df[new_columns].applymap(type) == list).all()
        list_columns = s[s].index.tolist()

        s = (df[new_columns].applymap(type) == dict).all()
        dict_columns = s[s].index.tolist()
        
        print(f"lists: {list_columns}, dicts: {dict_columns}")
        
    print(f"final shape: {df.shape}")
    print(f"final columns: {df.columns}")
    return df

It takes a dataframe that may have nested lists and/or dicts in its columns, and recursively explodes/flattens those columns.

It uses pandas' pd.json_normalize to explode the dictionaries (creating new columns), and pandas' explode to explode the lists (creating new rows).

Simple to use:

# Test
df = pd.DataFrame(
    columns=['id','name','columnA','columnB'],
    data=[
        [1,'John',{"dist": "600", "time": "0:12.10"},[{"pos": "1st", "value": "500"},{"pos": "2nd", "value": "300"},{"pos": "3rd", "value": "200"}, {"pos": "total", "value": "1000"}]],
        [2,'Mike',{"dist": "600"},[{"pos": "1st", "value": "500"},{"pos": "2nd", "value": "300"},{"pos": "total", "value": "800"}]]
    ])

flatten_nested_json_df(df)

It's not the most efficient thing on earth, and it has the side effect of resetting your dataframe's index, but it gets the job done. Feel free to tweak it.

This is BY FAR the best solution I have seen in a long time! Good job! — Serge de Gosson de Varennes, May 31 '21 at 15:06
Hi there, this is helpful, but doesnt seem to save the new dataframe — Cameron Stewart, Nov 05 '21 at 09:40
It gives me error `msg.format(req_len=len(left.columns), given_len=len(right)) ValueError: Unable to coerce to Series, length must be 44: given 1` — Atharv Thakur, Jun 17 '22 at 06:52
This can be improved to allow subtypes of `dict` and `list` (e.g. an `OrderedDict`) in the columns search with for example: ```s = df.applymap(lambda e: isinstance(e, dict)).all() ``` — badgerm, Oct 18 '22 at 14:01
This is a great solution one issue I found is when a key value (for any one record) in a multi-record json file is null. When this happens, the type function doesn't return the correct dict / list type resulting in the column being ignored. Not sure if there is there a way for the type function to account for the nulls in this situation — KahlilG, Jan 12 '23 at 03:04

score 15 · Answer 4 · answered Oct 06 '16 at 15:21

15

create a custom function to flatten columnB then use pd.concat

def flatten(js):
    return pd.DataFrame(js).set_index('pos').squeeze()

pd.concat([df.drop(['columnA', 'columnB'], axis=1),
           df.columnA.apply(pd.Series),
           df.columnB.apply(flatten)], axis=1)

answered Oct 06 '16 at 15:21

piRSquared

285,575
57
475
624

This was the only solution that works for me in a situation with a column B type. Look the elegance of the code! One line. – Diego May 18 '23 at 13:17

How to flatten a pandas dataframe with some columns as json?

4 Answers4

Linked