I have a dataset that looks like the one below. It is relational, but has a dimension called event_params
which is a JSON object of data related to the event_name
in the respective row.
import pandas as pd
a_df = pd.DataFrame(data={
'date_time': ['2021-01-03 15:12:42', '2021-01-03 15:12:46', '2021-01-03 15:13:01'
, '2021-01-03 15:13:12', '2021-01-03 15:13:13', '2021-01-03 15:13:15'
, '2021-01-04 03:29:01', '2021-01-04 18:15:14', '2021-01-04 18:16:01'],
'user_id': ['dhj13h', 'dhj13h', 'dhj13h', 'dhj13h', 'dhj13h', 'dhj13h', '38nr10', '38nr10', '38nr10'],
'account_id': ['181d9k', '181d9k', '181d9k', '181d9k', '181d9k', '181d9k', '56sf15', '56sf15', '56sf15'],
'event_name': ['button_click', 'screen_view', 'close_view', 'button_click', 'exit_app', 'uninstall_app'
, 'install_app', 'exit_app', 'uninstall_app'],
'event_params': ['{\'button_id\': \'shop_screen\', \'button_container\': \'main_screen\', \'button_label_text\': \'Enter Shop\'}',
'{\'screen_id\': \'shop_main_page\', \'screen_controller\': \'main_view_controller\', \'screen_title\': \'Main Menu\'}',
'{\'screen_id\': \'shop_main_page\'}',
'{\'button_id\': \'back_to_main_menu\', \'button_container\': \'shop_screen\', \'button_label_text\': \'Exit Shop\'}',
'{}',
'{}',
'{\'utm_campaign\': \'null\', \'utm_source\': \'null\'}',
'{}',
'{}']
})
I am looking for approaches on how to handle this sort of data. My initial approach is with pandas, but I'm open to other methods.
My ideal end state would be to examine each relationships with respect to each user. In the current form, I have to compare the dicts/JSON blobs sitting in event_params
to determine the context behind an event.
I've tried using explode()
to expand out the event_params
column. My thinking is the best sort of approach would be to turn event_params
into a relational format, where each parameter is an extra row of the dataframe with respect to it's preceding values (in other words, while maintaining the date_time
, user_id
and event_name
that it was related too initially).
My explode approach didn't work well,
a_df['event_params'] = a_df['event_params'].apply(eval)
exploded_df = a_df.explode('event_params')
The output of that was:
date_time, user_id, account_id, event_name, event_params
2021-01-03 15:12:42,dhj13h,181d9k,button_click,button_id
2021-01-03 15:12:42,dhj13h,181d9k,button_click,button_container
It has kind of worked, but it stripped the value fields. Ideally I'd like to maintain those value fields as well.