I have a huge CSV file (3.5GB and getting bigger everyday) which has normal values and one column called 'Metadata' with nested JSON values. My script is as below and the intention is simply to convert the JSON column into normal columns for each of its key-value pairs. I am using Python3 (Anaconda; Windows).
import pandas as pd
import numpy as np
import csv
import datetime as dt
from pandas.io.json import json_normalize
for df in pd.read_csv("source.csv", engine='c',
dayfirst=True,
encoding='utf-8',
header=0,
nrows=10,
chunksize=2,
converters={'Metadata':json.loads}):
## parsing code comes here
with open("output.csv", 'a', encoding='utf-8') as ofile:
df.to_csv(ofile, index=False, encoding='utf-8')
And the column has JSON in the following format:
{
"content_id":"xxxx",
"parental":"F",
"my_custom_data":{
"GroupId":"NA",
"group":null,
"userGuid":"xxxxxxxxxxxxxx",
"deviceGuid":"xxxxxxxxxxxxx",
"connType":"WIFI",
"channelName":"VOD",
"assetId":"xxxxxxxxxxxxx",
"GroupName":"NA",
"playType":"VOD",
"appVersion":"2.1.0",
"userEnvironmentContext":"",
"vodEncode":"H.264",
"language":"English"
}
}
The desired output is to have all the above key-value pairs as columns. The dataframe will have other non-JSON columns to which I need to add the columns parsed from the above JSON. I tried json_normalize
but I am not sure how to apply json_normalize
to a Series object and then convert it (or explode it) into multiple columns.