3

I have a DataFrame in Python that looks like the one below. Theres a column (referred to below as 'json') that has a large nested JSON string in it. How can I parse it out so I can have a nice clean data frame with many columns. just particularly need the cost and monthly amounts for each ID in separate columns. Ideally i'd have a table that looks like:

id, name, cost, monthly

10001, frank, 15.85, 15.85

10002, mary, 30.86, 23.03

    d = {'id': ['10001', '10002'], 'json': ['{"costs":[{"cost":15.85}],"policies":[{"logo":"HLIF-transparent-inhouse.png","monthly":15.85,"rating":"A++","waiverOfPremium":1.74,"carrier":"companyabc","face":250000,"term":20,"newFace":null,"newMonthly":null,"isCompanyD":true,"carrierCode":"xyz","product":"XYZt"}],"agentSuggestion":{"costs":[{"cost":15.85}],"options":{"product":"XYZt","gender":"male","healthClass":"0","smoker":"false","age":32,"term":"20","faceAmount":250000,"waiverOfPremiumAmount":1.74,"includeWaiverOfPremium":false,"state":"CT"},"policies":[{"logo":"HLIF-transparent-inhouse.png","monthly":15.85,"rating":"A++","waiverOfPremium":1.74,"carrier":"companyabc","face":250000,"term":20,"newFace":null,"newMonthly":null,"isCompanyD":true,"carrierCode":"xyz","product":"XYZt"}]}}', '{"costs":[{"cost":30.86}],"policies":[{"logo":"HLIF-transparent-inhouse.png","monthly":23.03,"rating":"A++","waiverOfPremium":7.83,"carrier":"companyabc","face":1000000,"term":10,"newFace":null,"newMonthly":null,"isCompanyD":true,"carrierCode":"xyz","product":"XYZt"}],"agentSuggestion":{"costs":[{"cost":30.86}],"options":{"product":"XYZt","gender":"female","healthClass":"0","smoker":"false","age":35,"term":10,"faceAmount":1000000,"waiverOfPremiumAmount":7.83,"includeWaiverOfPremium":true,"state":"GA"},"policies":[{"logo":"HLIF-transparent-inhouse.png","monthly":23.03,"rating":"A++","waiverOfPremium":7.83,"carrier":"companyabc","face":1000000,"term":10,"newFace":null,"newMonthly":null,"isCompanyD":true,"carrierCode":"xyz","product":"XYZt"}]}}'], 'name':['frank','mary']}

   test = pd.DataFrame(data=d)
PokerFace
  • 55
  • 1
  • 4

2 Answers2

4

There you go. You have 2 different costs in your JSON (cost and agentSuggestion cost), so both are added here:

import json
test = pd.DataFrame(d, columns = ['id', 'json', 'name'])
test['cost'] = test['json'].transform(lambda x: json.loads(x)['costs'][0]['cost'])
test['agent_suggestion_cost'] = test['json']\
    .transform(lambda x: json.loads(x)['agentSuggestion']["costs"][0]['cost'])
print(test)

You can follow similar logic for parsing other fields like monthly. For more reference see e.g.here Look around for a JSON prettifier (e.g. JSTool with Notepad++) to view your JSON's structure, this will help to understand its structure.

Kindly accept the answer if you found it useful.

Turo
  • 1,537
  • 2
  • 21
  • 42
0

Pandas offers a couple of utilities for dealing with json files. The ones that make sense for your case are pd.read_json and pd.io.json_normalize. They do however expect input in a different json format than yours.

orient : string,

Indication of expected JSON string format. Compatible JSON strings can be produced by to_json() with a corresponding orient value. The set of possible orients is:

'split' : dict like {index -> [index], columns -> [columns], data -> [values]}
'records' : list like [{column -> value}, ... , {column -> value}]
'index' : dict like {index -> {column -> value}}
'columns' : dict like {column -> {index -> value}}
'values' : just the values array
The allowed and default values depend on the value of the typ parameter.

when typ == 'series',
allowed orients are {'split','records','index'}
default is 'index'
The Series index must be unique for orient 'index'.
when typ == 'frame',
allowed orients are {'split','records','index', 'columns','values'}
default is 'columns'
The DataFrame index must be unique for orients 'index' and 'columns'.
The DataFrame columns must be unique for orients 'index', 'columns', and 'records'.
tobsecret
  • 2,442
  • 15
  • 26