71

I have a dataframe df that loads data from a database. Most of the columns are json strings while some are even list of jsons. For example:

id     name     columnA                               columnB
1     John     {"dist": "600", "time": "0:12.10"}    [{"pos": "1st", "value": "500"},{"pos": "2nd", "value": "300"},{"pos": "3rd", "value": "200"}, {"pos": "total", "value": "1000"}]
2     Mike     {"dist": "600"}                       [{"pos": "1st", "value": "500"},{"pos": "2nd", "value": "300"},{"pos": "total", "value": "800"}]
...

As you can see, not all the rows have the same number of elements in the json strings for a column.

What I need to do is keep the normal columns like id and name as it is and flatten the json columns like so:

id    name   columnA.dist   columnA.time   columnB.pos.1st   columnB.pos.2nd   columnB.pos.3rd     columnB.pos.total
1     John   600            0:12.10        500               300               200                 1000 
2     Mark   600            NaN            500               300               Nan                 800 

I have tried using json_normalize like so:

from pandas.io.json import json_normalize
json_normalize(df)

But there seems to be some problems with keyerror. What is the correct way of doing this?

petezurich
  • 9,280
  • 9
  • 43
  • 57
sfactor
  • 12,592
  • 32
  • 102
  • 152
  • What about values in column B ? You want to flatten the dictionaries also ? – MMF Oct 06 '16 at 14:44
  • yes. they need to be flattened as well. there was a typo in the original question where I put columnA for all the flattened column but corrected it now. – sfactor Oct 06 '16 at 14:51

4 Answers4

58

Here's a solution using json_normalize() again by using a custom function to get the data in the correct format understood by json_normalize function.

import ast
from pandas.io.json import json_normalize

def only_dict(d):
    '''
    Convert json string representation of dictionary to a python dict
    '''
    return ast.literal_eval(d)

def list_of_dicts(ld):
    '''
    Create a mapping of the tuples formed after 
    converting json strings of list to a python list   
    '''
    return dict([(list(d.values())[1], list(d.values())[0]) for d in ast.literal_eval(ld)])

A = json_normalize(df['columnA'].apply(only_dict).tolist()).add_prefix('columnA.')
B = json_normalize(df['columnB'].apply(list_of_dicts).tolist()).add_prefix('columnB.pos.') 

Finally, join the DFs on the common index to get:

df[['id', 'name']].join([A, B])

Image


EDIT:- As per the comment by @MartijnPieters, the recommended way of decoding the json strings would be to use json.loads() which is much faster when compared to using ast.literal_eval() if you know that the data source is JSON.

Nickil Maveli
  • 29,155
  • 8
  • 82
  • 85
  • 1
    Great thanks for the answer ! one thing though, is the returned lists on the list_of_dicts (list(d.values())[0], list(d.values())[1]), and not the other way round? Otherwise this worked perfect for me. – sfactor Oct 07 '16 at 06:17
  • 1
    As you would know that `dictionaries` do not preserve the order while performing iteration, the values present in the `dict` were appearing in the order opposite to that of yours and hence was the need to use the slicing notation differently compared to yours. If it's appearing in the same order as you've mentioned, go ahead with it or you can even make use of an [`Ordered Dict`](https://docs.python.org/3/library/collections.html#collections.OrderedDict) to preserve the order if you want to. – Nickil Maveli Oct 07 '16 at 08:22
  • 3
    Why the (slow!) `ast.literal_eval()` call when you should be using `json.loads()`? The latter handles correct JSON data, the former only *Python* syntax, which *differs materially* when it comes to booleans, nulls and unicode data outside of the BMP. – Martijn Pieters Nov 11 '17 at 21:37
  • @MartijnPieters: Thanks for the comment. I've updated my post. – Nickil Maveli Nov 12 '17 at 08:25
  • Not only is it faster, it'll also avoid `ValueError` exceptions when `true`, `false` or `null` values are involved. JSON is not Python. – Martijn Pieters Nov 13 '17 at 13:17
  • @NickilMaveli how do I use this if I have `nan` value in column A, I am getting an error `malformed node or string: nan` when trying to use the `only_dict` function, but when I remove the 'nan' values from the column A it is fine. Thank you – Durga Gaddam Apr 09 '18 at 21:49
  • 1
    If your data contains null, you can update the `only_dict` method to: `return ast.literal_eval(d) if pd.notnull(d) else {}` Otherwise, it returns `ValueError: malformed node or string: nan` – E. Zeytinci Nov 18 '19 at 21:03
53

The quickest seems to be:

import pandas as pd
import json

json_struct = json.loads(df.to_json(orient="records"))    
df_flat = pd.io.json.json_normalize(json_struct) #use pd.io.json
LucyDrops
  • 539
  • 5
  • 15
staonas
  • 597
  • 4
  • 5
24

TL;DR Copy-paste the following function and use it like this: flatten_nested_json_df(df)

This is the most general function I could come up with:

def flatten_nested_json_df(df):
    
    df = df.reset_index()
    
    print(f"original shape: {df.shape}")
    print(f"original columns: {df.columns}")
    
    
    # search for columns to explode/flatten
    s = (df.applymap(type) == list).all()
    list_columns = s[s].index.tolist()
    
    s = (df.applymap(type) == dict).all()
    dict_columns = s[s].index.tolist()
    
    print(f"lists: {list_columns}, dicts: {dict_columns}")
    while len(list_columns) > 0 or len(dict_columns) > 0:
        new_columns = []
        
        for col in dict_columns:
            print(f"flattening: {col}")
            # explode dictionaries horizontally, adding new columns
            horiz_exploded = pd.json_normalize(df[col]).add_prefix(f'{col}.')
            horiz_exploded.index = df.index
            df = pd.concat([df, horiz_exploded], axis=1).drop(columns=[col])
            new_columns.extend(horiz_exploded.columns) # inplace
        
        for col in list_columns:
            print(f"exploding: {col}")
            # explode lists vertically, adding new columns
            df = df.drop(columns=[col]).join(df[col].explode().to_frame())
            # Prevent combinatorial explosion when multiple
            # cols have lists or lists of lists
            df = df.reset_index(drop=True)
            new_columns.append(col)
        
        # check if there are still dict o list fields to flatten
        s = (df[new_columns].applymap(type) == list).all()
        list_columns = s[s].index.tolist()

        s = (df[new_columns].applymap(type) == dict).all()
        dict_columns = s[s].index.tolist()
        
        print(f"lists: {list_columns}, dicts: {dict_columns}")
        
    print(f"final shape: {df.shape}")
    print(f"final columns: {df.columns}")
    return df

It takes a dataframe that may have nested lists and/or dicts in its columns, and recursively explodes/flattens those columns.

It uses pandas' pd.json_normalize to explode the dictionaries (creating new columns), and pandas' explode to explode the lists (creating new rows).

Simple to use:

# Test
df = pd.DataFrame(
    columns=['id','name','columnA','columnB'],
    data=[
        [1,'John',{"dist": "600", "time": "0:12.10"},[{"pos": "1st", "value": "500"},{"pos": "2nd", "value": "300"},{"pos": "3rd", "value": "200"}, {"pos": "total", "value": "1000"}]],
        [2,'Mike',{"dist": "600"},[{"pos": "1st", "value": "500"},{"pos": "2nd", "value": "300"},{"pos": "total", "value": "800"}]]
    ])

flatten_nested_json_df(df)

It's not the most efficient thing on earth, and it has the side effect of resetting your dataframe's index, but it gets the job done. Feel free to tweak it.

CarlosK
  • 3
  • 3
Michele Piccolini
  • 2,634
  • 16
  • 29
  • 3
    This is BY FAR the best solution I have seen in a long time! Good job! – Serge de Gosson de Varennes May 31 '21 at 15:06
  • Hi there, this is helpful, but doesnt seem to save the new dataframe – Cameron Stewart Nov 05 '21 at 09:40
  • @CameronStewart save where? – Michele Piccolini Nov 08 '21 at 20:18
  • It gives me error `msg.format(req_len=len(left.columns), given_len=len(right)) ValueError: Unable to coerce to Series, length must be 44: given 1` – Atharv Thakur Jun 17 '22 at 06:52
  • By far the best ever solution on this problem...thumbs up – John Sep 28 '22 at 11:04
  • This can be improved to allow subtypes of `dict` and `list` (e.g. an `OrderedDict`) in the columns search with for example: ```s = df.applymap(lambda e: isinstance(e, dict)).all() ``` – badgerm Oct 18 '22 at 14:01
  • This is a great solution one issue I found is when a key value (for any one record) in a multi-record json file is null. When this happens, the type function doesn't return the correct dict / list type resulting in the column being ignored. Not sure if there is there a way for the type function to account for the nulls in this situation – KahlilG Jan 12 '23 at 03:04
15

create a custom function to flatten columnB then use pd.concat

def flatten(js):
    return pd.DataFrame(js).set_index('pos').squeeze()

pd.concat([df.drop(['columnA', 'columnB'], axis=1),
           df.columnA.apply(pd.Series),
           df.columnB.apply(flatten)], axis=1)

enter image description here

piRSquared
  • 285,575
  • 57
  • 475
  • 624
  • This was the only solution that works for me in a situation with a column B type. Look the elegance of the code! One line. – Diego May 18 '23 at 13:17