3

I'm trying to find all the attributes of the data in a nested dictionary in Python. Some objects may have multiple levels in their key definition. How can I find the header of such a complicated nested data (if we think as a table structure). Here are very few lines of my data to see how it looks like:

{"MessageType": "SALES.HOLDCREATED", "Event": {"Id": "ZWbDoMKQw6HDjFzCo8KuwpNmwofCjl7Co8OPwpDCncOSXMOdccKTZVVWZWbCnA==", "RefInfo": {"TId": {"Id": "ZMKXwpbClsOhwpNiw5E="}, "UserId": {"Id": "wpzCksKWwpbCpMKTYsKeZMKZbA=="}, "SentUtc": "2013-04-28T16:59:48.6698042", "Source": 1}, "ItemId": {"Id": 116228}, "Quantity": 1, "ExpirationDate": "2013-04-29T", "Description": null}}
{"MessageType": "SALES.SALEITEMCREATED", "Event": {"Id": "ZWbDoMKQw6HDjFzCo8KuwpNmwofCjl7Co8OPwpDCncOSXMOdccKTwp3CiFZkZMKWwpfCpMKZ", "RefInfo": {"TId": {"Id": "ZGA="}, "UserId": {"Id": "ZMKj"}, "SentUtc": "2013-01-04T", "Source": 1}, "Code": {"Code": "074108235206"}, "Sku": {"Sku": "Con CS54"}}}
{"MessageType": "SALES.SALEITEMCREATED", "Event": {"Id": "ZWbDoMKQw6HDjFzCo8KuwpNmwofCjl7Co8OPwpDCncOSXMOdccKTZcKHVsKcwpjClsKXwqTCmQ==", "RefInfo": {"TId": {"Id": "ZGA="}, "UserId": {"Id": "ZMKj"}, "SentUtc": "2013-01-04T", "Source": 1}, "Code": {"Code": "4000000021"}, "Sku": {"Sku": "NFL-Wallet-MK-2201"}}}

Since this data is in Json format first I changed the format and tried to find the key:

import json

data = []
with open("data.raw", "r") as f:
    for line in f:
        data.append(json.loads(line))

for lines in data:
    print(lines.keys())

but it gives me dict_keys(['Event', 'MessageType']) for all the lines. What I need (for this data that I attached) is a list like:

'MessageType' 'Event_Id' 'Event_RefInfo_TId_Id'  'Event_RefInfo_UserId_Id' 'Event_RefInfo_SentUtc' 'Event_RefInfo_Source' 'Event_ItemId_Id' 'Event_ItemId_Quantity' 'Event_ItemId_ExpirationDate'     ...

The data is very big and I just need to find out what attributes do I have.

Moses Koledoye
  • 77,341
  • 8
  • 133
  • 139
Mina
  • 51
  • 1
  • 1
  • 7

1 Answers1

1

You'll need to traverse the nested dicts, your current approach only gets as far as the keys of the root dictionary.

You can use the following generator function to find the keys and traverse nested dicts recursively:

import json 
from pprint import pprint

def find_keys(dct):
    for k, v in dct.items():
        if isinstance(v, dict):
            # traverse nested dict
            for x in find_keys(v):
                yield "{}_{}".format(k, x)
        else:
            yield k

Given a list of dictionaries as derived from your json object, you can find the keys in each dict and put them in a set so entries are unique:

s = set()
for d in json.loads(lst):
    s.update(find_keys(d))

pprint(s)

set(['Event_Code_Code',
     'Event_Description',
     'Event_ExpirationDate',
     'Event_Id',
     'Event_ItemId_Id',
     'Event_Quantity',
     'Event_RefInfo_SentUtc',
     'Event_RefInfo_Source',
     'Event_RefInfo_TId_Id',
     'Event_RefInfo_UserId_Id',
     'Event_Sku_Sku',
     'MessageType'])
Moses Koledoye
  • 77,341
  • 8
  • 133
  • 139
  • Thank you so much. This function worked perfectly. Here is a problem... When I apply this method for a data size that I can read it into my memory that's ok. The new problem arrises when I want to process a big data. – Mina Jul 03 '17 at 19:29
  • Because I have to use readlines() to define a list of the string and although I define a buffering size in opening the file it reads the whole file (and not the only buffer size). How can I read only the piece of data that I define in buffering size in open function? – Mina Jul 03 '17 at 19:36