Speeding up the extraction of data from nested JSON contained in DataFrame column

Question

I have a dataframe with a column containing nested JSON data

_p_final_fromPICKLE['label'].iloc[2]

'{"params": [{"X": 1500}, {"Y": 0.087}, {"Z": 0}, {"T": 0}]}'

I would like to populate new columns which contain directly the data (X,Y,Z and T in this case).

The best I could come up with at the moment is

     _p_final_fromPICKLE['Z']=_p_final_fromPICKLE['label'].apply(lambda x: utils.merge_dicts(*json.loads(x)["params"])['Z'] )

where

def merge_dicts(*dict_args):
    """
    Given any number of dicts, shallow copy and merge into a new dict,
    precedence goes to key value pairs in latter dicts.
    """
    result = {}
    for dictionary in dict_args:
        result.update(dictionary)
    return result

Creating the column Z takes 90 seconds to process 13M rows. In my opinion this is too slow, be cause I can make simple operations between columns in a fraction of a second. I think the function I defined is not the major cause because just turning the column in a JSON take 78 seconds

_p_final_fromPICKLE['params']=_p_final_fromPICKLE['label'].apply(lambda x:  json.loads(x)["params"]  )

So I was wondering if I am making this inefficient in some way. I do not see any way around from using json.loads because I have a JSON string in the label and I need to convert it into a dictionary to use the data. This json.loads seems to be the slow part. Maybe is due to how I am calling it. I see people here uses assign instead of apply but I have not managed to find a simple call that uses apply for my nested case.

Any suggestion on how to make this run faster?

IIUC, `pd.json_normalize(_p_final_fromPICKLE['label'],record_path=['params'])` the reason it takes 90s for 13m rows is due to apply which is a row level operation as opposed to set based. How much memory is this consuming? because 13m rows in 90s is quite good for an apply statement imo. — Umar.H, Aug 05 '20 at 13:14
How do I check the memory usage? Are you asking for the result of .info() ? — Rho Phi, Aug 05 '20 at 13:31
The code you wrote assumes 'label' is already JSON, whereas it is a string in my case, that is why I was passing it by `json.loads`. For this reason it gives an error `TypeError: string indices must be integers` — Rho Phi, Aug 05 '20 at 13:33
What would be the *set* operation to use in alternative to `apply`? — Rho Phi, Aug 05 '20 at 13:34
try using `literal_eval` `from ast import literal_eval` then `pd.json_normalize(df['j'].map(literal_eval),record_path=['params'])` `df['j']` is your json column here — Umar.H, Aug 05 '20 at 13:59
this runs ok, but I stopped it after 12 minutes. It seems it had filled up memory (I have 64 Gb) and went largely into swap — Rho Phi, Aug 05 '20 at 15:40
well you're evaluating strings as objects so the memory usage will pileup with each iteration, another option would be to save the json as a pure json file and read it line by line and let python handle the memory allocation. `Dask` or `Pyspark` maybe better for such a large dataframe. — Umar.H, Aug 05 '20 at 15:44

Speeding up the extraction of data from nested JSON contained in DataFrame column

0 Answers0