3

I have a huge CSV file (3.5GB and getting bigger everyday) which has normal values and one column called 'Metadata' with nested JSON values. My script is as below and the intention is simply to convert the JSON column into normal columns for each of its key-value pairs. I am using Python3 (Anaconda; Windows).

import pandas as pd
import numpy as np
import csv
import datetime as dt

from pandas.io.json import json_normalize

for df in pd.read_csv("source.csv", engine='c', 
    dayfirst=True, 
    encoding='utf-8', 
    header=0,
    nrows=10,
    chunksize=2,
    converters={'Metadata':json.loads}):

    ## parsing code comes here

    with open("output.csv", 'a', encoding='utf-8') as ofile:
        df.to_csv(ofile, index=False, encoding='utf-8')

And the column has JSON in the following format:

{  
   "content_id":"xxxx",
   "parental":"F",
   "my_custom_data":{  
      "GroupId":"NA",
      "group":null,
      "userGuid":"xxxxxxxxxxxxxx",
      "deviceGuid":"xxxxxxxxxxxxx",
      "connType":"WIFI",
      "channelName":"VOD",
      "assetId":"xxxxxxxxxxxxx",
      "GroupName":"NA",
      "playType":"VOD",
      "appVersion":"2.1.0",
      "userEnvironmentContext":"",
      "vodEncode":"H.264",
      "language":"English"
   }
}

The desired output is to have all the above key-value pairs as columns. The dataframe will have other non-JSON columns to which I need to add the columns parsed from the above JSON. I tried json_normalize but I am not sure how to apply json_normalize to a Series object and then convert it (or explode it) into multiple columns.

Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
A.Ali
  • 749
  • 6
  • 14

1 Answers1

3

Just use json_normalize() on the series directly, and then use pandas.concat() to merge the new dataframe with the existing dataframe:

pd.concat([df, json_normalize(df['Metadata'])])

You can add a .drop('Metadata', axis=1) if you no longer need the old column with the JSON datastructure in it.

The columns produced for the my_custom_data nested dictionary will have my_custom_data. prefixed. If all the names in that nested dictionary are unique, you could drop that prefix with a DataFrame.rename() operation:

json_normalize(df['Metadata']).rename(
    columns=lambda n: n[15:] if n.startswith('my_custom_data.') else n)

If you are using some other means to convert each dictionary value to a flattened structure (say, with flatten_json, then you want to use Series.apply() to process each value and then return each resulting dictionary as a pandas.Series() object:

def some_conversion_function(dictionary):
    result = something_that_processes_dictionary_into_a_flat_dict(dictionary)
    return pd.Series(something_that_processes_dictionary_into_a_flat_dict)

You can then concatenate the result of the Series.apply() call (which will be a dataframe) back onto your original dataframe:

pd.concat([df, df['Metadata'].apply(some_conversion_function)])
Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
  • Thanks for the reply... the reason I'm looping over it is that read_csv returns an Iterator when you use chunksize=xxx.... Meanwhile, I am also getting a KeyError: 'group' from json_normalize. – A.Ali Jun 22 '18 at 14:37
  • @A.Ali: ah, my mistake, I missed the chunksize argument there. – Martijn Pieters Jun 22 '18 at 14:39
  • Apparently, json_normalize doesn't deal well with my json. It gives me a KeyError: 'group'. I'm having better success using the flatten_json library. Now I have a python dict and need to create a dataframe using a dict and then append it to the existing dataframe. – A.Ali Jun 22 '18 at 15:01
  • @A.Ali: if you have a flat dictionary (all values are scalars), just create a `Series()` directly from that, using `df[columnname].apply(pd.Series)`; this returns a new dataframe, use `pd.concat()` just like with `json_normalize()`. – Martijn Pieters Jun 22 '18 at 15:15
  • I went for this solution: def xflatten(js): d = flatten_json.flatten(js) for key, value in d.items(): d[key] = [d[key]] return pd.DataFrame.from_dict(d, orient='columns') .... AND ... newdf = pd.concat([df.drop(['Metadata'], axis=1), df.Metadata.apply(xflatten)], axis=1) ... but it prints the column in a "pretty" format in my csv, lol – A.Ali Jun 22 '18 at 15:28
  • @A.Ali: why did you return a dataframe? Just return `pd.Series(d)`, without wrapping each value in a list. – Martijn Pieters Jun 22 '18 at 15:43
  • 1
    `def xflatten(js): return pd.Series(flatten_json.flatten(js))`, then `pd.concat([df.drop(['Metadata'], axis=1), df.Metadata.apply(xflatten)])` – Martijn Pieters Jun 22 '18 at 15:47
  • Got it! It works!! Please edit your answer so that I can mark it as the final one! Major thanks! – A.Ali Jun 22 '18 at 16:02