I have a dataframe in PySpark with 3 columns - json, date and object_id:
-----------------------------------------------------------------------------------------
|json |date |object_id|
-----------------------------------------------------------------------------------------
|{'a':{'b':0,'c':{'50':0.005,'60':0,'100':0},'d':0.01,'e':0,'f':2}}|2020-08-01|xyz123 |
|{'a':{'m':0,'n':{'50':0.005,'60':0,'100':0},'d':0.01,'e':0,'f':2}}|2020-08-02|xyz123 |
|{'g':{'h':0,'j':{'50':0.005,'80':0,'100':0},'d':0.02}} |2020-08-03|xyz123 |
-----------------------------------------------------------------------------------------
Now I have a list of variables: [a.c.60, a.n.60, a.d, g.h]. I need to extract only these variables from the json column of above mentioned dataframe and to add those variables as columns in the dataframe with their respective values.
So in the end, the dataframe should look like:
-------------------------------------------------------------------------------------------------------
|json |date |object_id|a.c.60|a.n.60|a.d |g.h|
-------------------------------------------------------------------------------------------------------
|{'a':{'b':0,'c':{'50':0.005,'60':0,'100':0},'d':0.01,...|2020-08-01|xyz123 |0 |null |0.01|null|
|{'a':{'m':0,'n':{'50':0.005,'60':0,'100':0},'d':0.01,...|2020-08-02|xyz123 |null |0 |0.01|null|
|{'g':{'h':0,'j':{'k':0.005,'':0,'100':0},'d':0.01}} |2020-08-03|xyz123 |null |null |0.02|0 |
-------------------------------------------------------------------------------------------------------
Please help to get this result dataframe. The main problem I am facing is due to no fixed structure for the incoming json data. The json data can be anything in nested form but I need to extract only the given four variables. I have achieved this in Pandas by flattening out the json string and then to extract the 4 variables but in Spark it is getting difficult.