-1

I have data in a dataframe that resembles below. I am trying to flatten the data so the values in the JSON blobs convert to columns and if there are multiple records in the JSON blob it creates a new row. I would like to leave non-JSON fields (i.e ID) as is.

I have specifically followed: How to flatten a pandas dataframe with some columns as json? - but after execution I am left unsuccessful with the same dataframe with unparsed JSON. Below is a snippet of the function from that post and that has worked for me before:

def flatten_nested_json_df(df):

    df = df.reset_index()

    print(f"original shape: {df.shape}")
    print(f"original columns: {df.columns}")


    # search for columns to explode/flatten
    s = (df.applymap(type) == list).all()
    list_columns = s[s].index.tolist()

    s = (df.applymap(type) == dict).all()
    dict_columns = s[s].index.tolist()

    print(f"lists: {list_columns}, dicts: {dict_columns}")
    while len(list_columns) > 0 or len(dict_columns) > 0:
        new_columns = []

        for col in dict_columns:
            print(f"flattening: {col}")
            # explode dictionaries horizontally, adding new columns
            horiz_exploded = pd.json_normalize(df[col]).add_prefix(f'{col}.')
            horiz_exploded.index = df.index
            df = pd.concat([df, horiz_exploded], axis=1).drop(columns=[col])
            new_columns.extend(horiz_exploded.columns) # inplace

        for col in list_columns:
            print(f"exploding: {col}")
            # explode lists vertically, adding new columns
            df = df.drop(columns=[col]).join(df[col].explode().to_frame())
            new_columns.append(col)

        # check if there are still dict o list fields to flatten
        s = (df[new_columns].applymap(type) == list).all()
        list_columns = s[s].index.tolist()

        s = (df[new_columns].applymap(type) == dict).all()
        dict_columns = s[s].index.tolist()

        print(f"lists: {list_columns}, dicts: {dict_columns}")

    print(f"final shape: {df.shape}")
    print(f"final columns: {df.columns}")
    return df

Any help is appreciated!

ID PROPERTIES FORMSUBMISSIONS
123 {"firstname":{"value":"FAKE"},"lastmodifieddate":{"value":"FAKE"},"lastname":{"value":"FAKE"}} [{"contact-associated-by":["FAKE"],"conversion-id":"FAKE","form-id":"FAKE","form-type":"FAKE","meta-data":[],"portal-id": FAKE,"timestamp": FAKE,"title":"FAKE"}]
Emma
  • 8,518
  • 1
  • 18
  • 35
hansolo
  • 903
  • 4
  • 12
  • 28
  • Every single method I am trying I get erros 'str' object has no attribute 'values' or 'list' object has no attribute 'values' - I know I must be missing something simple – hansolo Jan 12 '22 at 16:29
  • The error usually means what you passed to `pd.json_normalize` is a list instead of an object. so do `explode` the column before you pass to `pd.json_normalize`. Also, please include your code that you have currently so that people can help on it. – Emma Jan 12 '22 at 16:41
  • @emma thanks for the reply - just edited my initial post to add the function I have been trying out that has worked for me before - I can get the desired results when I do `pd.read_json(dataframe["FORMSUBMISSIONS"][0])` but I cant figure out how to apply that across the dataframe – hansolo Jan 12 '22 at 16:50

1 Answers1

1

You can use pd.json_normalize which should be more simple.

>>> df
    ID                                         PROPERTIES                                    FORMSUBMISSIONS
0  123  {'firstname': {'value': 'FAKE'}, 'lastmodified...  [{'contact-associated-by': ['FAKE'], 'conversi...

>>> df = df.explode('FORMSUBMISSIONS')  # list to dict
>>> df
    ID                                         PROPERTIES                                    FORMSUBMISSIONS
0  123  {'firstname': {'value': 'FAKE'}, 'lastmodified...  {'contact-associated-by': ['FAKE'], 'conversio...

Now you can do json_normalize on the FORMSUBMISSIONS column. To preserve the other columns, I use pd.concat

>>> df = pd.concat([df, pd.json_normalize(df['FORMSUBMISSIONS']), axis=1).drop('FORMSUBMISSIONS', axis=1)

>>> df
    ID                                         PROPERTIES contact-associated-by conversion-id form-id form-type meta-data portal-id timestamp title
0  123  {'firstname': {'value': 'FAKE'}, 'lastmodified...                [FAKE]          FAKE    FAKE      FAKE        []      FAKE      FAKE  FAKE

You can do the same thing on PROPERTIES column.

df = pd.concat([df, pd.json_normalize(df.PROPERTIES)], axis=1).drop('PROPERTIES', axis=1)
Emma
  • 8,518
  • 1
  • 18
  • 35
  • thank you again for your help here - I executed this: ```df = dataframe df1 = df.explode('FORMSUBMISSIONS') df1 = pd.concat(df1, pd.json_normalize(df1['FORMSUBMISSIONS']), axis=1).drop('FORMSUBMISSIONS', axis=1) df1``` but got the same error `'str' object has no attribute 'values'` – hansolo Jan 12 '22 at 17:22
  • Sounds like you have string in the column. Can you paste the original JSON, if you are reading data from JSON file? if not, could you show me what you get from `df.to_dict()` before `explode`? – Emma Jan 12 '22 at 17:31
  • `'FORMSUBMISSIONS': {0: '[{"contact-associated-by":["EMAIL"],"conversion-id":"FAKEVALUE","form-id":"FAKEVALUE","form-type":"FAKEVALUE","meta-data":[],"portal-id":FAKEVALUE,"timestamp":FAKEVALUE,"title":"FAKEVALUE"}]',` that is the first record after I run df.to_dict() – hansolo Jan 12 '22 at 17:43
  • 1
    Right. you have stringified list for `FORMSUBMISSIONS`. You need to unstringify before the `explode`. `df['FORMSUBMISSIONS'] = [json.loads(x) for x in df.FORMSUBMISSIONS]` `df = df.explode('FORMSUBMISSIONS')` – Emma Jan 12 '22 at 18:21
  • wow emma - I think that worked. The same method is not working for PROPERTIES, but I am guessing I just have to play around with that a little more ..Cant thank you enough – hansolo Jan 12 '22 at 19:02
  • I would guess that you also have stringified object for PROPERTIES. So you need to do `json.loads` before doing `pd.json_normalize`. – Emma Jan 12 '22 at 19:16