1

There's a table where one data point of its column event looks like this:

THE 'event IS A STRING COLUMN!

df['event']
RETURNS:
"{'eventData': {'type': 'page', 'name': "WHAT'S UP"}, 'eventId': '1003', 'deviceType': 'kk', 'pageUrl': '/chick 2/whats sup', 'version': '1.0.0.888-10_7_2020__4_18_30', 'sessionGUID': '1b312346a-cd26-4ce6-888-f25143030e02', 'locationid': 'locakdi-3b0c-49e3-ab64-741f07fd4cb3', 'eventDescription': 'Page Load'}"

I'm trying to extract the nested dictionary eventData from the dictionary and create a new column like below:

df['event'] 
RETURNS: 
{'eventId': '1003', 'deviceType': 'kk', 'pageUrl': '/chick 2/whats sup', 'version': '1.0.0.888-10_7_2020__4_18_30', 'sessionGUID': '1b312346a-cd26-4ce6-888-f25143030e02', 'locationid': 'locakdi-3b0c-49e3-ab64-741f07fd4cb3', 'eventDescription': 'Page Load'}

df['eventData']
RETURNS:
{'type': 'page', 'name': "WHAT'S UP"}

How do I do this?

2 Answers2

0

I would look at using the pandas apply method on the event column.

If the eventData key is expected to be present in the event column dictionary for all rows of the data frame, something below may suffice

import json
import numpy as np

def get_event_data_from_event(event_str):
    """
    Convert event string to dict and return event_data
    """
    try:
        event_as_dict = json.loads(event_str)
    except json.decoder.JSONDecodeError:
        return np.nan
    else
        if not "eventData" in event_as_dict.keys():
            return np.nan
        return event_as_dict["eventData"]  

df["eventData"] = df["event"].apply(lambda x: get_event_data_from_event(x))

Which will return an N/A for that row in the eventData column if the event dictionary is not formatted as you expected it to be.

You could then drop those non-conforming rows with a dropna like so:

df_subset = df.dropna(axis='columns', subset="eventData")
Alexis Lucattini
  • 1,211
  • 9
  • 13
  • Sorry about the misleading. The 'event' column is actually a STRING column. The only way I can think of is to apply REGEX to get the second curly bracket "{ }" from that string and create a new column. Do you have any idea how to construct the regex for this case? – Winston Feng Nov 18 '20 at 03:59
  • I've updated my answer above to expect the event column to be a string. You can use `json.loads` to load a json string as a dict/list object. – Alexis Lucattini Apr 05 '21 at 23:39
0

I've finally fot the answer from another post: Python flatten multilevel/nested JSON

How to use: json_col = pd.DataFrame([flatten_json(x) for x in df['json_column']])

def flatten_json(nested_json, exclude=['']):
    out = {}
    def flatten(x, name='', exclude=exclude):
        if type(x) is dict:
            for a in x:
                if a not in exclude: flatten(x[a], name + a + '_')
        elif type(x) is list:
            i = 0
            for a in x:
                flatten(a, name + str(i) + '_')
                i += 1
        else:
            out[name[:-1]] = x

    flatten(nested_json)
    return out