Create/ re-create a list of dictionaries from a dictionary via Python Recursion function

Question

So, I'm trying to parse this json object into multiple events, as it's the expected input for a ETL tool. I know this is quite straight forward if we do this via loops, if statements and explicitly defining the search fields for given events. This method is not feasible because I have multiple heavily nested JSON objects and I would prefer to let the python recursions handle the heavy lifting. The following is a sample object, which consist of string, list and dict (basically covers most use-cases, from the data I have).

{
"event_name": "restaurants",
"properties": {
    "_id": "5a9909384309cf90b5739342",
    "name": "Mangal Kebab Turkish Restaurant",
    "restaurant_id": "41009112",
    "borough": "Queens",
    "cuisine": "Turkish",
    "address": {
        "building": "4620",
        "coord": {
            "0": -73.9180155,
            "1": 40.7427742
        },
        "street": "Queens Boulevard",
        "zipcode": "11104"
    },
    "grades": [
         {
            "date": 1414540800000,
            "grade": "A",
            "score": 12
        },
        {
            "date": 1397692800000,
            "grade": "A",
            "score": 10
        },
        {
            "date": 1381276800000,
            "grade": "A",
            "score": 12
        }
    ]
}
}

And I want to convert it to this following list of dictionaries

[
    {
        "event_name": "restaurants",
            "properties": {
            "restaurant_id": "41009112",
            "name": "Mangal Kebab Turkish Restaurant",
            "cuisine": "Turkish",
            "_id": "5a9909384309cf90b5739342",
            "borough": "Queens"
        }
    },
    {
        "event_name": "restaurant_address",
        "properties": {
            "zipcode": "11104",
            "ref_id": "41009112",
            "street": "Queens Boulevard",
            "building": "4620"
        }
    },
    {
        "event_name": "restaurant_address_coord"
        "ref_id": "41009112"
        "0": -73.9180155,
        "1": 40.7427742
    },
    {
        "event_name": "restaurant_grades",
        "properties": {
            "date": 1414540800000,
            "ref_id": "41009112",
            "score": 12,
            "grade": "A",
            "index": "0"
        }
    },
    {
        "event_name": "restaurant_grades",
        "properties": {
            "date": 1397692800000,
            "ref_id": "41009112",
            "score": 10,
            "grade": "A",
            "index": "1"
        }
    },
    {
        "event_name": "restaurant_grades",
        "properties": {
            "date": 1381276800000,
            "ref_id": "41009112",
            "score": 12,
            "grade": "A",
            "index": "2"
        }
    }
]

And most importantly these events will be broken up into independent structured tables to conduct joins, we need to create primary keys/ unique identifiers. So the deeply nested dictionaries should have its corresponding parents_id field as ref_id. In this case ref_id = restaurant_id from its parent dictionary.

Most of the example on the internet flatten's the whole object to be normalized and into a dataframe, but to utilise this ETL tool to its full potential it would be ideal to solve this problem via recursions and outputting as list of dictionaries.

what have you tried so far and where is the code or where are you stuck ? the question needs sufficient code for a minimum reproducable example: https://stackoverflow.com/help/minimal-reproducible-example — D.L, Aug 31 '22 at 06:44
So I'm using this as the basics. But I'm really confused about the approach I should take, even sudo code will be much appreciated. https://stackoverflow.com/questions/58442723/how-to-flatten-a-nested-json-recursively-with-flatten-json — harish mohan, Aug 31 '22 at 08:47
so, it seem there are two approaches: [1] flatten the data and then unflatten it into another structure or [2] create another structure and move with rules between the two ? — D.L, Aug 31 '22 at 09:13
Also, the return is a `list` with one element (a dict of dicts), like this `[{}]`, so is the `list` necessary ? — D.L, Aug 31 '22 at 09:19
yea flattening the data and them unflattening seems like a good option, but you would have to explicitly define the keys (and values) to group as dict right and getting the parent _id keys will be a pain? (maybe I'm missing some thing?). 2nd question: So nested events to list of events is necessary because the ETL tool expects a list or else it treats it as a property (with no event_name), its just events=[], events.append(dict) on a loop right? — harish mohan, Aug 31 '22 at 09:48

score 0 · Answer 1 · answered Aug 31 '22 at 09:54

This is what one might call a brute force method. Create a translator function to move each item into the correct part of the new structure (like a schema).

# input dict
d = {
"event_name": "demo",
"properties": {
    "_id": "5a9909384309cf90b5739342",
    "name": "Mangal Kebab Turkish Restaurant",
    "restaurant_id": "41009112",
    "borough": "Queens",
    "cuisine": "Turkish",
    "address": {
        "building": "4620",
        "coord": {
            "0": -73.9180155,
            "1": 40.7427742
        },
        "street": "Queens Boulevard",
        "zipcode": "11104"
    },
    "grades": [
         {
            "date": 1414540800000,
            "grade": "A",
            "score": 12
        },
        {
            "date": 1397692800000,
            "grade": "A",
            "score": 10
        },
        {
            "date": 1381276800000,
            "grade": "A",
            "score": 12
        }
    ]
}
}


def convert_structure(d: dict):
    ''' function to convert to new structure'''

    # the new dict
    e = {}
    e['event_name'] = d['event_name']
    e['properties'] = {}
    e['properties']['restaurant_id'] = d['properties']['restaurant_id']
    # and so forth...
    # keep building the new structure / template

    # return a list
    return [e]

# run & print
x = convert_structure(d)
print(x)

the reuslt (for the part done) looks like this:

[{'event_name': 'demo', 'properties': {'restaurant_id': '41009112'}}]

If a pattern is identified, then the above could be improved...

Yea, I'm doing this currently, but not going to scale for my problem tho. This is just one level nesting, I have documents that are like 10-15 levels nested. Im just trying to break it down to the smallest problem statement, lol. Thousands of attributes can't create a schema manually, unfortunately. — harish mohan, Aug 31 '22 at 10:12
@harishmohan: i would concur, but there must be a logic of transferring items between the two... even if the data is flattened, without a `translation logic`, the result could not be achieved (unless there is a systematic unpacking method). — D.L, Aug 31 '22 at 10:21
Yea there must be a way for sure... I found this article, they are creating unique address for each key.values location and building a schema. AutoFlatten is python, they are using that class on multiple objects via pyspark for parallel execution. Looks interesting, and ignoring the pyspark aspect. It looks promising, I think.. https://link.medium.com/gaoAhQJgWsb — harish mohan, Aug 31 '22 at 10:39

Create/ re-create a list of dictionaries from a dictionary via Python Recursion function

1 Answers1