I have a dataframe that stores a JSON object in one column. I want to process the JSON object to create a new dataframe (different number and type of columns, and each row will generate n new rows from the JSON object). I wrote this logic below that appends a dictionary (row) to a list while iterating through the original dataset.
data = []
def process_row_data(row):
global data
for item in row.json_object['obj']:
# create a dictionary to represent each row of a new dataframe
parsed_row = {'a': item.a, 'b':item.b, ..... 'zyx':item.zyx}
data.append(parsed_row)
df.apply(lambda row: process_row_data(row), axis=1)
# create the new dataframe
df_final = pd.DataFrame.from_dict(data)
However, this solution doesn't seem to be scalable when the number of rows and the size of the parsed_row
grow.
Is there a way to write this in a scalable way with PySpark?