0

So, I'm trying to parse this json object into multiple events, as it's the expected input for a ETL tool. I know this is quite straight forward if we do this via loops, if statements and explicitly defining the search fields for given events. This method is not feasible because I have multiple heavily nested JSON objects and I would prefer to let the python recursions handle the heavy lifting. The following is a sample object, which consist of string, list and dict (basically covers most use-cases, from the data I have).

{
"event_name": "restaurants",
"properties": {
    "_id": "5a9909384309cf90b5739342",
    "name": "Mangal Kebab Turkish Restaurant",
    "restaurant_id": "41009112",
    "borough": "Queens",
    "cuisine": "Turkish",
    "address": {
        "building": "4620",
        "coord": {
            "0": -73.9180155,
            "1": 40.7427742
        },
        "street": "Queens Boulevard",
        "zipcode": "11104"
    },
    "grades": [
         {
            "date": 1414540800000,
            "grade": "A",
            "score": 12
        },
        {
            "date": 1397692800000,
            "grade": "A",
            "score": 10
        },
        {
            "date": 1381276800000,
            "grade": "A",
            "score": 12
        }
    ]
}
}

And I want to convert it to this following list of dictionaries

[
    {
        "event_name": "restaurants",
            "properties": {
            "restaurant_id": "41009112",
            "name": "Mangal Kebab Turkish Restaurant",
            "cuisine": "Turkish",
            "_id": "5a9909384309cf90b5739342",
            "borough": "Queens"
        }
    },
    {
        "event_name": "restaurant_address",
        "properties": {
            "zipcode": "11104",
            "ref_id": "41009112",
            "street": "Queens Boulevard",
            "building": "4620"
        }
    },
    {
        "event_name": "restaurant_address_coord"
        "ref_id": "41009112"
        "0": -73.9180155,
        "1": 40.7427742
    },
    {
        "event_name": "restaurant_grades",
        "properties": {
            "date": 1414540800000,
            "ref_id": "41009112",
            "score": 12,
            "grade": "A",
            "index": "0"
        }
    },
    {
        "event_name": "restaurant_grades",
        "properties": {
            "date": 1397692800000,
            "ref_id": "41009112",
            "score": 10,
            "grade": "A",
            "index": "1"
        }
    },
    {
        "event_name": "restaurant_grades",
        "properties": {
            "date": 1381276800000,
            "ref_id": "41009112",
            "score": 12,
            "grade": "A",
            "index": "2"
        }
    }
]

And most importantly these events will be broken up into independent structured tables to conduct joins, we need to create primary keys/ unique identifiers. So the deeply nested dictionaries should have its corresponding parents_id field as ref_id. In this case ref_id = restaurant_id from its parent dictionary.

Most of the example on the internet flatten's the whole object to be normalized and into a dataframe, but to utilise this ETL tool to its full potential it would be ideal to solve this problem via recursions and outputting as list of dictionaries.

  • what have you tried so far and where is the code or where are you stuck ? the question needs sufficient code for a minimum reproducable example: https://stackoverflow.com/help/minimal-reproducible-example – D.L Aug 31 '22 at 06:44
  • So I'm using this as the basics. But I'm really confused about the approach I should take, even sudo code will be much appreciated. https://stackoverflow.com/questions/58442723/how-to-flatten-a-nested-json-recursively-with-flatten-json – harish mohan Aug 31 '22 at 08:47
  • so, it seem there are two approaches: [1] flatten the data and then unflatten it into another structure or [2] create another structure and move with rules between the two ? – D.L Aug 31 '22 at 09:13
  • Also, the return is a `list` with one element (a dict of dicts), like this `[{}]`, so is the `list` necessary ? – D.L Aug 31 '22 at 09:19
  • yea flattening the data and them unflattening seems like a good option, but you would have to explicitly define the keys (and values) to group as dict right and getting the parent _id keys will be a pain? (maybe I'm missing some thing?). 2nd question: So nested events to list of events is necessary because the ETL tool expects a list or else it treats it as a property (with no event_name), its just events=[], events.append(dict) on a loop right? – harish mohan Aug 31 '22 at 09:48

1 Answers1

0

This is what one might call a brute force method. Create a translator function to move each item into the correct part of the new structure (like a schema).

# input dict
d = {
"event_name": "demo",
"properties": {
    "_id": "5a9909384309cf90b5739342",
    "name": "Mangal Kebab Turkish Restaurant",
    "restaurant_id": "41009112",
    "borough": "Queens",
    "cuisine": "Turkish",
    "address": {
        "building": "4620",
        "coord": {
            "0": -73.9180155,
            "1": 40.7427742
        },
        "street": "Queens Boulevard",
        "zipcode": "11104"
    },
    "grades": [
         {
            "date": 1414540800000,
            "grade": "A",
            "score": 12
        },
        {
            "date": 1397692800000,
            "grade": "A",
            "score": 10
        },
        {
            "date": 1381276800000,
            "grade": "A",
            "score": 12
        }
    ]
}
}


def convert_structure(d: dict):
    ''' function to convert to new structure'''

    # the new dict
    e = {}
    e['event_name'] = d['event_name']
    e['properties'] = {}
    e['properties']['restaurant_id'] = d['properties']['restaurant_id']
    # and so forth...
    # keep building the new structure / template

    # return a list
    return [e]

# run & print
x = convert_structure(d)
print(x)

the reuslt (for the part done) looks like this:

[{'event_name': 'demo', 'properties': {'restaurant_id': '41009112'}}]

If a pattern is identified, then the above could be improved...

D.L
  • 4,339
  • 5
  • 22
  • 45
  • Yea, I'm doing this currently, but not going to scale for my problem tho. This is just one level nesting, I have documents that are like 10-15 levels nested. Im just trying to break it down to the smallest problem statement, lol. Thousands of attributes can't create a schema manually, unfortunately. – harish mohan Aug 31 '22 at 10:12
  • @harishmohan: i would concur, but there must be a logic of transferring items between the two... even if the data is flattened, without a `translation logic`, the result could not be achieved (unless there is a systematic unpacking method). – D.L Aug 31 '22 at 10:21
  • Yea there must be a way for sure... I found this article, they are creating unique address for each key.values location and building a schema. AutoFlatten is python, they are using that class on multiple objects via pyspark for parallel execution. Looks interesting, and ignoring the pyspark aspect. It looks promising, I think.. https://link.medium.com/gaoAhQJgWsb – harish mohan Aug 31 '22 at 10:39