6

I have a dataset that looks like the one below. It is relational, but has a dimension called event_params which is a JSON object of data related to the event_name in the respective row.

import pandas as pd

a_df = pd.DataFrame(data={
    'date_time': ['2021-01-03 15:12:42', '2021-01-03 15:12:46', '2021-01-03 15:13:01'
                  , '2021-01-03 15:13:12', '2021-01-03 15:13:13', '2021-01-03 15:13:15'
                  , '2021-01-04 03:29:01', '2021-01-04 18:15:14', '2021-01-04 18:16:01'],
    'user_id': ['dhj13h', 'dhj13h', 'dhj13h', 'dhj13h', 'dhj13h', 'dhj13h', '38nr10', '38nr10', '38nr10'],
    'account_id': ['181d9k', '181d9k', '181d9k', '181d9k', '181d9k', '181d9k', '56sf15', '56sf15', '56sf15'],
    'event_name': ['button_click', 'screen_view', 'close_view', 'button_click', 'exit_app', 'uninstall_app'
                   , 'install_app', 'exit_app', 'uninstall_app'],
    'event_params': ['{\'button_id\': \'shop_screen\', \'button_container\': \'main_screen\', \'button_label_text\': \'Enter Shop\'}',
                     '{\'screen_id\': \'shop_main_page\', \'screen_controller\': \'main_view_controller\', \'screen_title\': \'Main Menu\'}',
                     '{\'screen_id\': \'shop_main_page\'}',
                     '{\'button_id\': \'back_to_main_menu\', \'button_container\': \'shop_screen\', \'button_label_text\': \'Exit Shop\'}',
                     '{}',
                     '{}',
                     '{\'utm_campaign\': \'null\', \'utm_source\': \'null\'}',
                     '{}',
                     '{}']
    })

I am looking for approaches on how to handle this sort of data. My initial approach is with pandas, but I'm open to other methods.

My ideal end state would be to examine each relationships with respect to each user. In the current form, I have to compare the dicts/JSON blobs sitting in event_params to determine the context behind an event.

I've tried using explode() to expand out the event_params column. My thinking is the best sort of approach would be to turn event_params into a relational format, where each parameter is an extra row of the dataframe with respect to it's preceding values (in other words, while maintaining the date_time, user_id and event_name that it was related too initially).

My explode approach didn't work well,

a_df['event_params'] = a_df['event_params'].apply(eval)
exploded_df = a_df.explode('event_params')

The output of that was:

date_time, user_id, account_id, event_name, event_params
2021-01-03 15:12:42,dhj13h,181d9k,button_click,button_id
2021-01-03 15:12:42,dhj13h,181d9k,button_click,button_container

It has kind of worked, but it stripped the value fields. Ideally I'd like to maintain those value fields as well.

Cold Fish
  • 242
  • 3
  • 13
  • It isn't clear to me what you are trying to do exactly. "I am looking for approaches on how to handle this sort of data" / "My ideal end state would be to examine each relationships with respect to each user." --> I don't understand what your target is. If you could define what you want to do more precisely, you're more likely to get help – slymore Sep 10 '22 at 00:56
  • I have nested key-value parameter data. My only idea to explore that data is to explode it to make it relational. I am looking for other approaches on how I could examine this sort of data structure. My question is exploratory in nature. In my view, I can't define what exactly I'm looking for more than that. – Cold Fish Sep 10 '22 at 01:03

4 Answers4

7

I hope I've understood your question right. You can transform the event_params column from dict to list of dicts, explode it and transform to new columns key/value:

from ast import literal_eval


a_df = a_df.assign(
    event_params=a_df["event_params"].apply(
        lambda x: [{"key": k, "value": v} for k, v in literal_eval(x).items()]
    )
).explode("event_params")

a_df = pd.concat(
    [a_df, a_df.pop("event_params").apply(pd.Series)],
    axis=1,
).drop(columns=0)

print(a_df)

Prints:

             date_time user_id account_id     event_name                key                 value
0  2021-01-03 15:12:42  dhj13h     181d9k   button_click          button_id           shop_screen
0  2021-01-03 15:12:42  dhj13h     181d9k   button_click   button_container           main_screen
0  2021-01-03 15:12:42  dhj13h     181d9k   button_click  button_label_text            Enter Shop
1  2021-01-03 15:12:46  dhj13h     181d9k    screen_view          screen_id        shop_main_page
1  2021-01-03 15:12:46  dhj13h     181d9k    screen_view  screen_controller  main_view_controller
1  2021-01-03 15:12:46  dhj13h     181d9k    screen_view       screen_title             Main Menu
2  2021-01-03 15:13:01  dhj13h     181d9k     close_view          screen_id        shop_main_page
3  2021-01-03 15:13:12  dhj13h     181d9k   button_click          button_id     back_to_main_menu
3  2021-01-03 15:13:12  dhj13h     181d9k   button_click   button_container           shop_screen
3  2021-01-03 15:13:12  dhj13h     181d9k   button_click  button_label_text             Exit Shop
4  2021-01-03 15:13:13  dhj13h     181d9k       exit_app                NaN                   NaN
5  2021-01-03 15:13:15  dhj13h     181d9k  uninstall_app                NaN                   NaN
6  2021-01-04 03:29:01  38nr10     56sf15    install_app       utm_campaign                  null
6  2021-01-04 03:29:01  38nr10     56sf15    install_app         utm_source                  null
7  2021-01-04 18:15:14  38nr10     56sf15       exit_app                NaN                   NaN
8  2021-01-04 18:16:01  38nr10     56sf15  uninstall_app                NaN                   NaN
Andrej Kesely
  • 168,389
  • 15
  • 48
  • 91
  • Is there a reason for `literal_eval` from `ast` over the standard `eval` call? – Cold Fish Sep 07 '22 at 22:30
  • 2
    @ColdFish `eval` is not considered safe. especially running on output from the outside (internet). See [this](https://stackoverflow.com/questions/15197673/using-pythons-eval-vs-ast-literal-eval) for example – Andrej Kesely Sep 07 '22 at 22:33
3

A version based on Andrej Kesely's great solution with speed improvements. The differences are:

  • I compared the computational time for a bigger input df (pd.concat([a_df]*1000)) and found this solution to be way faster as Andrej Kesely's solution, ~50ms vs. 6s. See below at Speed testing.
  • Instead of eval you can also parse the json column with the json-package.
  • For readability and debugging, I prefer to not pack everything in one line and avoid lambda's, but thats a personal preference of course.

Define the unpacker

import json

def read_json(string):
    return [{'event_key': i, 'event_value': j} for i, j in json.loads(string).items()]
    
def explode_json(df, json_column):
    df = df.copy()  # needed to keep df untouched
    
    # read json + json.loads needs " only
    df[json_column] = df[json_column].str.replace("\'", '\"')
    df[json_column] = df[json_column].apply(lambda x: json.loads(x).items())
    
    # explode df, with ignore_index to get unique indexes
    df = df.explode(json_column, ignore_index=True)
    
    df_event = pd.DataFrame(data=df[json_column].to_list(),
                            columns=('event_key', 'event_value'))
    return pd.concat([df.drop(json_column, axis=1), df_event],
                     axis=1, ignore_index=False)

Apply it

a_df = explode_json(a_df, "event_params")

With output (similar to Andrej Kesely):

             date_time user_id account_id     event_name          event_key           event_value 
-------------------------------------------------------------------------------------------------- 
0  2021-01-03 15:12:42  dhj13h     181d9k   button_click          button_id           shop_screen   
1  2021-01-03 15:12:42  dhj13h     181d9k   button_click   button_container           main_screen   
2  2021-01-03 15:12:42  dhj13h     181d9k   button_click  button_label_text            Enter Shop     
...  

Speed testing

Define the function explode_json_ak of Andrej Kesely's version. df.copy() is needed here to keep the original dataframe across the runs of timeit.

def explode_json_ak(df, json_column, inplace=False):
    a_df = df.copy()  # needed to keep df untouched
    a_df = a_df.assign(
        event_params=a_df[json_column].apply(
            lambda x: [{"key": k, "value": v} for k, v in literal_eval(x).items()]
        )
    ).explode(json_column)

    return pd.concat(
        [a_df, a_df.pop(json_column).apply(pd.Series)],
        axis=1,
    ).drop(columns=0)

And get both execution times.

df_test = pd.concat([a_df]*1000) # get a bit more data

%timeit -n 2 -r 1 explode_json_ak(df_test, "event_params")  # Andrej Kesely
# 6.51 s ± 0 ns per loop (mean ± std. dev. of 1 run, 2 loops each)

%timeit -n 3 -r 3 explode_json(df_test, "event_params")  # This solution
# 42 ms ± 1.88 ms per loop (mean ± std. dev. of 3 runs, 3 loops each)

The more events or long the json are, the more difference will be between both versions. The speed gain comes from reading the rows as a DataFrame with avoiding the .apply() and json.loads is ~4 times faster here than ast.literal_eval.

Different views for the DataFrame

With that exploded format, pandas allows several ways to view it.

a_df.pivot(index=['user_id', 'date_time'], columns='event_key')
  • or split things into groups
gb = df.groupby(['user_id', 'date_time'])
print(list(gb.groups))  # list of groups: [('38nr10', '2021-01-04 03:29:01'), ...]
gb.get_group(('38nr10', '2021-01-04 03:29:01'))
  • and more...
Andrew
  • 817
  • 4
  • 9
  • 1
    OP's data has single quotes, not double quotes; that'd be against the JSON spec (but not Python's dict literals, which is why `ast.literal_eval` works.) – AKX Sep 13 '22 at 15:12
  • (And blindly replacing all single quotes with double quotes will also do so within actual data such as strings.) – AKX Sep 13 '22 at 15:13
  • Its also a question of speed. `json.loads` is ~4 times faster than `ast.literal_eval` for the example below. (5.81 µs vs. 21.3 µs). `a = '{\'button_id\': \'shop_screen\', \'button_container\': \'main_screen\', \'button_label_text\': \'Enter Shop\'}'` - `%timeit -n 3 -r 3 literal_eval(a)` - `%timeit -n 3 -r 3 json.loads(a.replace("'", '"'))` – Andrew Sep 15 '22 at 12:12
2

I did some research on EDA on nonrelational data, and there didn't find recommendations that didn't involve wrangling it into a table. (Of course it depends on the type of analysis you want to do - sometimes just visualising nested JSON well can be enough to come up with some ideas.)

Andrej Kesely's answer is great, but I thought it might be helpful to change the data into "wide" format instead of "long", since the keys here can be seen as features of the user action.

from ast import literal_eval

a_df = pd.DataFrame(data)

tmp1 =  a_df["event_params"].apply(lambda x: literal_eval(x).items()).explode()

tmp2 = pd.DataFrame(tmp1.to_list())
tmp2.index = tmp1.index

a_df = (a_df
        .merge(tmp2
               .dropna(subset=0)
               .pivot(columns=0, values=1), 
               how='left', left_index=True, right_index=True)
        .drop(columns='event_params'))

This gives you a dataframe in which each key is a feature. Then if you want to look at a specific user, you can ignore the irrelevant columns like this:

(a_df
.groupby(['user_id', 'account_id'])
.get_group(('38nr10', '56sf15'))
.dropna(axis=1, how='all'))

Output:

    date_time              user_id  account_id      event_name  utm_campaign    utm_source
6   2021-01-04 03:29:01     38nr10      56sf15      install_app     null    null
7   2021-01-04 18:15:14     38nr10      56sf15      exit_app        NaN     NaN
8   2021-01-04 18:16:01     38nr10      56sf15      uninstall_app   NaN     NaN
Josh Friedlander
  • 10,870
  • 5
  • 35
  • 75
1

Alternative approach without expload()

Alternative approach using pd.json_normalize and pd.melt could also be used to transform the event_params into relational format

from ast import literal_eval

df = a_df.merge(
    pd.melt(
        pd.json_normalize(a_df['event_params'].map(lambda x: literal_eval(x))),
        ignore_index=False),
    how='outer',
    left_index=True,
    right_index=True) \
    .drop(columns='event_params')

print(df)

output:

              date_time user_id account_id     event_name           variable  \
0   2021-01-03 15:12:42  dhj13h     181d9k   button_click          button_id   
0   2021-01-03 15:12:42  dhj13h     181d9k   button_click   button_container   
0   2021-01-03 15:12:42  dhj13h     181d9k   button_click  button_label_text   
0   2021-01-03 15:12:42  dhj13h     181d9k   button_click          screen_id   
0   2021-01-03 15:12:42  dhj13h     181d9k   button_click  screen_controller   
..                  ...     ...        ...            ...                ...   
8   2021-01-04 18:16:01  38nr10     56sf15  uninstall_app          screen_id   
8   2021-01-04 18:16:01  38nr10     56sf15  uninstall_app  screen_controller   
8   2021-01-04 18:16:01  38nr10     56sf15  uninstall_app       screen_title   
8   2021-01-04 18:16:01  38nr10     56sf15  uninstall_app       utm_campaign   
8   2021-01-04 18:16:01  38nr10     56sf15  uninstall_app         utm_source   

          value  
0   shop_screen  
0   main_screen  
0    Enter Shop  
0           NaN  
0           NaN  
..          ...  
8           NaN  
8           NaN  
8           NaN  
8           NaN  
8           NaN  

[72 rows x 6 columns]

This would assign NaN to events to all actions not related to a particular event, which could be used while analysis