1

I have been struggling with this logic. I am getting data from google analytics in this format, per user per site. (So these are all the activities that one user performed on the site) I cannot change the format in which I receive the data.

PROBLEM: I am running a loop through all of the users and get this output for each user. I want to put this data into a data frame in order to use it later. The problem I have, is the 'activities':[{.....},{......}] part, I cannot figure out to store all of this data in a way that makes sense.

{'sampleRate': 1,
 'sessions': [{'activities': [{'activityTime': '2020-01-08T16:00:44.399101Z',
                               'activityType': 'PAGEVIEW',
                               'campaign': '(not set)',
                               'channelGrouping': 'Direct',
                               'customDimension': [{'index': 1}],
                               'hostname': 'company.domain.com',
                               'keyword': '(not set)',
                               'landingPagePath': '/login',
                               'medium': '(none)',
                               'pageview': {'pagePath': '/thepath',
                                            'pageTitle': 'thecurrentwebpage'},
                               'source': '(direct)'},
                              {'activityTime': '2020-01-08T15:58:43.077293Z',
                               'activityType': 'PAGEVIEW',
                               'campaign': '(not set)',
                               'channelGrouping': 'Direct',
                               'customDimension': [{'index': 1}],
                               'hostname': 'company.domain.com',
                               'keyword': '(not set)',
                               'landingPagePath': '/login',
                               'medium': '(none)',
                               'pageview': {'pagePath': '/theotherpath',
                                            'pageTitle': 'thecurrentwebpage'},
                               'source': '(direct)'}],
               'dataSource': 'web',
               'deviceCategory': 'desktop',
               'platform': 'Windows',
               'sessionDate': '2020-01-08',
               'sessionId': '1578491x03d'},
              {'activities': [{'activityTime': '2019-12-28T21:58:48.993944Z',
                               'activityType': 'PAGEVIEW',
                               'campaign': '(not set)',.....

EXPECTED OUTPUT:

For each user's data to be stored in tables organised as such: Lucid chart brainstorming of ERD layouts

If there is some logic error in the picture, I am happy to change what I have. I just need the data to work.

PS: I need to use SQL and ERD's in LucidChart, I have never before had to manipulate data in this format. Any help, to get the data -that is structured like the example above- into a dataframe(s).

EDITED:

Example of two different types of activities (the activity is always classified as either 'pageview' or 'event'):

{'activityTime':
                               # Pageview activity
                               '2020-01-08T15:48:38.012671Z',
                               'activityType': 'PAGEVIEW',
                               'campaign': '(not set)',
                               'channelGrouping': 'Direct',
                               'customDimension': [{'index': 1}],
                               'hostname': 'company.domain.com',
                               'keyword': '(not set)',
                               'landingPagePath': '/login',
                               'medium': '(none)',
                               'pageview': {'pagePath': '/login',
                                            'pageTitle': 'titleofthepage'},
                               'source': '(direct)'},

                              # Event activity
                              {'activityTime': '2020-01-08T15:48:37.915105Z',
                               'activityType': 'EVENT',
                               'campaign': '(not set)',
                               'channelGrouping': 'Direct',
                               'customDimension': [{'index': 1}],
                               'event': {'eventAction': 'Successfully Logged '
                                                        'In',
                                         'eventCategory': 'Auth',
                                         'eventCount': '1',
                                         'eventLabel': '(not set)'},
                               'hostname': 'company.domain.com',
                               'keyword': '(not set)',
                               'landingPagePath': '/login',
                               'medium': '(none)',
                               'source': '(direct)'}]
Llewellyn Hattingh
  • 306
  • 1
  • 5
  • 16
  • Why do you split Table Activity, Event and Pageview? They can be all in the same table or is there a reason for this? In addition how do you get eventCategory, eventCount, eventLabel and eventAction? – Boendal Jan 23 '20 at 13:08
  • Thank you, so the output can either be pageview or event (to my knowledge, pageview: when the user switched to a different page while event: something that was done on the same page). There are lots of activities in one [{'activity': ...}] I will send you an example of 'pageview'-activity and an example of a 'event'-activity... In other words, there are lots of activities and a activity can either have event ouputs or pageview outputs, it will never be both. – Llewellyn Hattingh Jan 23 '20 at 13:14
  • If all three of the tables can be in one table, we can certainly do it that way, whichever way works best – Llewellyn Hattingh Jan 23 '20 at 13:19
  • It makes sense now. There are as far as I know advantage to store it in your table and disadvantages. Advantages: No joins, Disadvantages: Redudance (in your case not really), NULL values, bigger table (depending on your size doesn't fit all into memory). (I'm not an expert in databases but that is what I know) but with the information you shared you can do both, up to you – Boendal Jan 23 '20 at 13:32
  • Nice, thank you. How do I actually go about putting this data into a dataframe – Llewellyn Hattingh Jan 23 '20 at 13:39
  • I can show it to you maybe later. Have some private issues and no time right now. (if there is no answer later when I have time). But basically you parse the json, iterate through and create a dictionary. With a dictionary you can easy create a panda dataframe. – Boendal Jan 23 '20 at 13:47
  • Is there a way SQL can make this simpler? I am obligated to use SQL – Llewellyn Hattingh Jan 23 '20 at 13:54
  • you can also use SQL in Pandas (I don't use it a lot) but that is independent to creating the table/dataframe. Either way you have to create table for a SQL database or dataframe for pandas. And for that you need data processing. – Boendal Jan 23 '20 at 13:57

1 Answers1

1

For example you can do it like this:

import pandas as pd
import json
str = """{"sampleRate": 1,
 "sessions": [{"activities": [{"activityTime": "2020-01-08T16:00:44.399101Z",
                               "activityType": "PAGEVIEW",
                               "campaign": "(not set)",
                               "channelGrouping": "Direct",
                               "customDimension": [{"index": 1}],
                               "hostname": "company.domain.com",
                               "keyword": "(not set)",
                               "landingPagePath": "/login",
                               "medium": "(none)",
                               "pageview": {"pagePath": "/thepath",
                                            "pageTitle": "thecurrentwebpage"},
                               "source": "(direct)"},
                              {"activityTime": "2020-01-08T15:48:37.915105Z",
                               "activityType": "EVENT",
                               "campaign": "(not set)",
                               "channelGrouping": "Direct",
                               "customDimension": [{"index": 1}],
                               "event": {"eventAction": "Successfully Logged In",
                                         "eventCategory": "Auth",
                                         "eventCount": "1",
                                         "eventLabel": "(not set)"},
                               "hostname": "company.domain.com",
                               "keyword": "(not set)",
                               "landingPagePath": "/login",
                               "medium": "(none)",
                               "source": "(direct)"}],
               "dataSource": "web",
               "deviceCategory": "desktop",
               "platform": "Windows",
               "sessionDate": "2020-01-08",
               "sessionId": "1578491x03d"}]}"""


data = json.loads(str)

session_keys = "sessionId,dataSource,deviceCategory,platform,sessionDate,DB_id".split(",")
event_keys = "activityTime,eventCategory,eventCount,eventLabel,eventAction".split(",")
pageview_keys = "activityTime,pageTitle,pagePath".split(",")

sessions = {k:[] for k in session_keys}
events = {k:[] for k in event_keys}
pageviews = {k:[] for k in pageview_keys}
activities = {"sessionId":[],"activityTime":[]}

for session in data["sessions"]:
    for easy_key in session_keys[:5]:
        sessions[easy_key] += [session[easy_key]]
    for activity in session["activities"]:
        activity_time = activity["activityTime"]
        activities["sessionId"] += [session["sessionId"]]
        activities["activityTime"] += [activity_time]
        if activity["activityType"] == "PAGEVIEW":
            pageviews["activityTime"] += [activity_time]
            pageviews["pageTitle"] += [activity["pageview"]["pageTitle"]]
            pageviews["pagePath"] += [activity["pageview"]["pagePath"]]
        elif activity["activityType"] == "EVENT":
            events["activityTime"] += [activity_time]
            events["eventAction"] += [activity["event"]["eventAction"]]
            events["eventCategory"] += [activity["event"]["eventCategory"]]
            events["eventCount"] += [activity["event"]["eventCount"]]
            events["eventLabel"] += [activity["event"]["eventLabel"]]
        else:
            print("Unknown Activity: {}".format(activity["activityType"]))

    sessions["DB_id"] += [0]

df_session = pd.DataFrame.from_dict(sessions)
df_session.set_index('sessionId', inplace=True)
df_event = pd.DataFrame.from_dict(events)
df_event.set_index('activityTime', inplace=True)
df_pageview = pd.DataFrame.from_dict(pageviews)
df_pageview.set_index('activityTime', inplace=True)
df_activities = pd.DataFrame.from_dict(activities)

Output each DF:

#df_session:

            dataSource deviceCategory platform sessionDate  DB_id
sessionId                                                        
1578491x03d        web        desktop  Windows  2020-01-08      0



#df_activities:
     sessionId                 activityTime
0  1578491x03d  2020-01-08T16:00:44.399101Z
1  1578491x03d  2020-01-08T15:48:37.915105Z



#df_event:
                            eventCategory eventCount eventLabel             eventAction
activityTime                                                                           
2020-01-08T15:48:37.915105Z          Auth          1  (not set)  Successfully Logged In



#df_pageview:
                                     pageTitle  pagePath
activityTime                                            
2020-01-08T16:00:44.399101Z  thecurrentwebpage  /thepath

Output example join

#As example for a join, I only want the event data
df_sa = df_activities.join(df_session, on="sessionId").join(df_event,on="activityTime",how="right") 

print(df_sa)
     sessionId                 activityTime dataSource deviceCategory platform sessionDate  DB_id eventCategory eventCount eventLabel             eventAction
1  1578491x03d  2020-01-08T15:48:37.915105Z        web        desktop  Windows  2020-01-08      0          Auth          1  (not set)  Successfully Logged In

Schema

It is the same as you specified above with 2 changes:

  1. The Table session doesn't have a column activities anymore.

  2. The Table Activity has an additional column sessionId.

SQL

How to execute SQL in pandas Dataframe you can look up online probably to much to cover here. See here for example: Executing an SQL query over a pandas dataset

How to get the data

Some examples: (but in the end you have to figure it out yourself if you want something specific, I don't make an SQL course here)

  • If you only want session data: Query df_session
  • If you want all activities: Query df_event and df_pageview
  • If you want all activities and combined with sessions: join df_session with df_activities then join with df_event and df_pageview

I don't want a Dataframe... I need MYSQL Database (or something else)

Nothing easier than that. The dataframe are in a "properly" database format.

Example for Session:

for index, row in df_sessions.iterrows():
    # for event and pageview the index would be activityTime
    # the df activities don't have a specific index
    sessionId = index 
    dataSource = row['dataSource']
    deviceCategory = row['deviceCategory']
    platform = row['platform']
    sessionDate = row['sessionDate']
    DB_id = row['DB_id']
    # function to save a row in a SQL DB basically:
    # INSERT INTO session (sessionId,dataSource,deviceCategory,platform,sessionDate,DB_id) VALUES(x,x,x,x,x,x)
    save_to_sql(sessionId,dataSource,deviceCategory,platform,sessionDate,DB_id)

the save_to_sql is your own implementation depending on what database you are using. And it would not fit this question to explain that to you.

Comments

  1. DB_id don't know the origin of this value. I set it to 0.
Community
  • 1
  • 1
Boendal
  • 2,496
  • 1
  • 23
  • 36