I have a dataframe with a column containing nested JSON data
_p_final_fromPICKLE['label'].iloc[2]
'{"params": [{"X": 1500}, {"Y": 0.087}, {"Z": 0}, {"T": 0}]}'
I would like to populate new columns which contain directly the data (X
,Y
,Z
and T
in this case).
The best I could come up with at the moment is
_p_final_fromPICKLE['Z']=_p_final_fromPICKLE['label'].apply(lambda x: utils.merge_dicts(*json.loads(x)["params"])['Z'] )
where
def merge_dicts(*dict_args):
"""
Given any number of dicts, shallow copy and merge into a new dict,
precedence goes to key value pairs in latter dicts.
"""
result = {}
for dictionary in dict_args:
result.update(dictionary)
return result
Creating the column Z
takes 90 seconds to process 13M rows. In my opinion this is too slow, be cause I can make simple operations between columns in a fraction of a second. I think the function I defined is not the major cause because just turning the column in a JSON take 78 seconds
_p_final_fromPICKLE['params']=_p_final_fromPICKLE['label'].apply(lambda x: json.loads(x)["params"] )
So I was wondering if I am making this inefficient in some way. I do not see any way around from using json.loads
because I have a JSON string in the label
and I need to convert it into a dictionary to use the data. This json.loads
seems to be the slow part. Maybe is due to how I am calling it. I see people here uses assign
instead of apply
but I have not managed to find a simple call that uses apply for my nested case.
Any suggestion on how to make this run faster?