2

I am flattening a data frame in which the column contains a list of dictionaries. I have written the code for it. However, it takes around 25 seconds to process only 5000 rows which is a lot.

Here is the sample dataset:

event_date  timestamp   event_name      user_properties
20191117    1.57401E+15 user_engagement [{'key': 'ga_session_id', 'value': {'string_value': None, 'int_value': 1574005142, 'float_value': None, 'double_value': None, 'set_timestamp_micros': 1574005142713000}}, {'key': 'ga_session_number', 'value': {'string_value': None, 'int_value': 5, 'float_value': None, 'double_value': None, 'set_timestamp_micros': 1574005142713000}}, {'key': 'first_open_time', 'value': {'string_value': None, 'int_value': 1573974000000, 'float_value': None, 'double_value': None, 'set_timestamp_micros': 1573971590380000}}]
20191117    1.57401E+15 screen_view     [{'key': 'ga_session_id', 'value': {'string_value': None, 'int_value': 1574005142, 'float_value': None, 'double_value': None, 'set_timestamp_micros': 1574005142713000}}, {'key': 'ga_session_number', 'value': {'string_value': None, 'int_value': 5, 'float_value': None, 'double_value': None, 'set_timestamp_micros': 1574005142713000}}, {'key': 'first_open_time', 'value': {'string_value': None, 'int_value': 1573974000000, 'float_value': None, 'double_value': None, 'set_timestamp_micros': 1573971590380000}}]
20191117    1.57401E+15 user_engagement [{'key': 'ga_session_id', 'value': {'string_value': None, 'int_value': 1574005142, 'float_value': None, 'double_value': None, 'set_timestamp_micros': 1574005142713000}}, {'key': 'ga_session_number', 'value': {'string_value': None, 'int_value': 5, 'float_value': None, 'double_value': None, 'set_timestamp_micros': 1574005142713000}}, {'key': 'first_open_time', 'value': {'string_value': None, 'int_value': 1573974000000, 'float_value': None, 'double_value': None, 'set_timestamp_micros': 1573971590380000}}]
20191117    1.57401E+15 user_engagement [{'key': 'ga_session_id', 'value': {'string_value': None, 'int_value': 1574005142, 'float_value': None, 'double_value': None, 'set_timestamp_micros': 1574005142713000}}, {'key': 'ga_session_number', 'value': {'string_value': None, 'int_value': 5, 'float_value': None, 'double_value': None, 'set_timestamp_micros': 1574005142713000}}, {'key': 'first_open_time', 'value': {'string_value': None, 'int_value': 1573974000000, 'float_value': None, 'double_value': None, 'set_timestamp_micros': 1573971590380000}}]
20191117    1.57401E+15 user_engagement [{'key': 'ga_session_id', 'value': {'string_value': None, 'int_value': 1574005142, 'float_value': None, 'double_value': None, 'set_timestamp_micros': 1574005142713000}}, {'key': 'ga_session_number', 'value': {'string_value': None, 'int_value': 5, 'float_value': None, 'double_value': None, 'set_timestamp_micros': 1574005142713000}}, {'key': 'first_open_time', 'value': {'string_value': None, 'int_value': 1573974000000, 'float_value': None, 'double_value': None, 'set_timestamp_micros': 1573971590380000}}]

Here is the parsed dataframe:

flattened dataframe

The result contains the 'key' as column however, if there is 'set_timestamp_micros' key in the dictionary then the syntax of the column is {key}.set_timestamp_micros.

Here is the code to flatten the dataframe:

def normalize_complex_column_v2(df, df_copy, column):
    col_list = []
    for index,row in df.iterrows():
        for element in row[column]:
            cols = [element['key']]
            cols += ["%s.%s"%(element['key'],key) for key in element['value'].keys() if 'timestamp' in key]
            df_copy= df_copy.reindex(columns=list(dict.fromkeys(df_copy.columns.tolist() + cols)))
            df_copy.loc[index,cols] = [value for key,value in element['value'].items() if value is not None]
    df_copy.drop([column], axis=1, inplace=True)
    return df_copy

How do I optimize this code?

UPDATE: Is there any way I can use swifter to optimize my function?

Issue with Numba:

<ipython-input-101-15265d3af7fb>:1: NumbaWarning: 
Compilation is falling back to object mode WITH looplifting enabled because Function "flatten_dataframe_column" failed type inference due to: Untyped global name 'defaultdict': cannot determine Numba type of <class 'type'>

File "<ipython-input-101-15265d3af7fb>", line 4:
def flatten_dataframe_column(df,column,fetch_timestamp=True):
    <source elided>
    temp_dict = df[column].to_dict()
    new_dict = defaultdict(dictLoweringError: Failed in object mode pipeline (step: object mode backend)
$22.3.182

File "<ipython-input-101-15265d3af7fb>", line 16:
def flatten_dataframe_column(df,column,fetch_timestamp=True):
    <source elided>
                        elements['key'] : [value for key,value in elements['value'].items() \
                                                    if (value is not None and 'timestamp' not in key)][0]
                                                    ^

[1] During: lowering "$22.3.182 = unary(fn=<built-in function not_>, value=$22.3.182)" at <ipython-input-101-15265d3af7fb> (16)

-------------------------------------------------------------------------------
This should not have happened, a problem has occurred in Numba's internals.
You are currently using Numba version 0.47.0.

Please report the error message and traceback, along with a minimal reproducer
at: https://github.com/numba/numba/issues/new

If more help is needed please feel free to speak to the Numba core developers
directly at: https://gitter.im/numba/numba

Thanks in advance for your help in improving Numba!

)
rpanai
  • 12,515
  • 2
  • 42
  • 64
user1584253
  • 975
  • 2
  • 18
  • 55
  • 1
    Maybe more of a workaround than a solution but have you tried http://numba.pydata.org/ ? – ChatterOne Jan 16 '20 at 12:21
  • I tried numba, but getting type error, may be I have to mold the objects which are compatible with numba – user1584253 Jan 17 '20 at 11:36
  • There shouldn't really be a compatibility issue, maybe you can add what you tried to your question? – ChatterOne Jan 17 '20 at 11:51
  • If look here ( https://numba.pydata.org/numba-doc/dev/reference/pysupported.html ) you'll see that it mentions `Unsupported constructs: ... set, dict and generator comprehensions`. If you want to give it a go try converting that list comprehension to an explicit loop, though I'm not really confident about that being the problem – ChatterOne Jan 17 '20 at 12:44

2 Answers2

0

For starters, you can iterate through the values in your column instead of the entire dataframe and free up some of the ram. Second, your list comprehensions are "loopy." Third, copying and dropping your dataframe is computationally inefficient.

def normalize_complex_column_v2(df, df_copy, column):
    col_list = []
    for i, element in enumerate(df[column].values):
        # Get dictionary in list
        element = element[0] if type(element)==list else None
        # Optimize the code below by more efficiently looping through keys
        cols = [key for key in element.get('value').keys() if 'timestamp' in key]

        # Get values for your column:
        for key in cols:
            df.iloc[i, key] = element.get('value').get(key)

        # Now create the column for `element['key']`
        df.iloc[i, element.get('key')] = 'foo' # Some value for this, not sure where you're pulling from...

    return df

This should do the trick. Let me know how it compares!

Yaakov Bressler
  • 9,056
  • 2
  • 45
  • 69
  • 'element' is the list, AttributeError: 'list' object has no attribute 'get' – user1584253 Jan 16 '20 at 10:07
  • I am not copying and dropping the entire dataframe in my function. I am only dropping a single column outside the loop. So this is not the overhead. – user1584253 Jan 16 '20 at 11:19
  • Why are you assigning only first element of the list ?? – user1584253 Jan 17 '20 at 05:38
  • Also, if you look into my code, if there is a timestamp key in a dictionary then the name is appended with key outside the dictionary – user1584253 Jan 17 '20 at 05:40
  • Error in inner loop also there ..ValueError: Can only index by location with a [integer, integer slice (START point is INCLUDED, END point is EXCLUDED), listlike of integers, boolean array] – user1584253 Jan 17 '20 at 05:46
0

I converted the dataframe column to dictionary and processed the data there. Then converted the processed dictionary to dataframe and joined with original dataframe by 'index'. It took around around 8 seconds to process 500K records.

def flatten_dataframe_column(df,column):
    temp_dict = df[column].to_dict()
    new_dict = defaultdict(dict)
    for item in temp_dict.items():
        for elements in item[1]:
               new_dict[item[0]].update(
                        {
                            (elements['key']+'.set_timestamp_micros') : elements['value']['set_timestamp_micros']
                        }
                )
                new_dict[item[0]].update(
                    { 
                        elements['key'] : [value for key,value in elements['value'].items() \
                                                    if (value is not None and 'timestamp' not in key)][0]
                    }
                )
    return pd.DataFrame.from_dict(new_dict,orient='index')

If anyone can think of a more optimal solution, please do post.

user1584253
  • 975
  • 2
  • 18
  • 55
  • if you can share the dataset in an easy to copy way. the format above does not come out right on my pc – sammywemmy Jan 17 '20 at 13:16
  • How can I share the excel file here? – user1584253 Jan 18 '20 at 10:13
  • not the entire excel sheet. just read it into pandas, do a print(df.head()), copy and paste the first five rows here, so we can work with that. – sammywemmy Jan 18 '20 at 10:42
  • @sammywemmy I tried df.head() but it trims the value in user_properties column since it is too long. You can simply copy the data in excel file and try parsing it – user1584253 Jan 20 '20 at 05:07