7

Let's say that we have the following JSON file. For the sake of the example it's emulated by a string. The string is the input and a Tree object should be the output. I'll be using the graphical notation of a tree to present the output.

I've found the following classes to handle tree concept in Python:

class TreeNode(object):
    def __init__(self, data):
        self.data = data
        self.children = []

    def add_child(self, obj):
        self.children.append(obj)

    def __str__(self, level=0):
        ret = "\t"*level+repr(self.data)+"\n"
        for child in self.children:
            ret += child.__str__(level+1)
        return ret

    def __repr__(self):
        return '<tree node representation>'

class Tree:
    def __init__(self):
        self.root = TreeNode('ROOT')

    def __str__(self):
        return self.root.__str__()

The input file can be of different complexity:

Simple case

Input:

json_file = '{"item1": "end1", "item2": "end2"}'

Output:

"ROOT"
    item1
        end1
    item2
        end2

Embedded case

Input:

json_file = {"item1": "end1", "item2": {"item3": "end3"}}

Output:

"ROOT"
    item1
        end1
    item2
        item3
            end3

Array case

Input:

json_file = { "name": "John", "items": [ { "item_name": "lettuce", "price": 2.65, "units": "no" }, { "item_name": "ketchup", "price": 1.51, "units": "litres" } ] }

Output:

"ROOT"
    name
        John
    items
        1
            item_name
                lettuce
            price
                2.65
            units
                no
        2   
            item_name
                ketchup
            price
                1.51
            units
                litres

Please note that each item in an array is described with an integer (starting at 1).

So far I've managed to come up with the following function that solves the problem for the simple case. In terms of the embedded case I know that I must use recursion but so far I get UnboundLocalError: local variable 'tree' referenced before assignment.

def create_tree_from_JSON(json, parent=None):
    if not parent:
        tree = Tree()
        node_0 = TreeNode("ROOT")
        tree.root = node_0
        parent = node_0
    else:
        parent = parent

    for key in json:
        if isinstance(json[key], dict):
            head = TreeNode(key)
            create_tree_from_JSON(json[key], head)
        else:
            node = TreeNode(key)
            node.add_child(TreeNode(json[key]))
            parent.add_child(node)

    return tree

Problem's background

You may wonder why would I need to change a JSON object into a tree. As you may know PostgreSQL provides a way to handle JSON fields in the database. Given a JSON object I can get the value of any field by using -> and ->> notation. Here and here more about the subject. I will be creating new tables based on the fields' names and values. Unfortunately the JSON objects vary to such an extent that I cannot write the .sql code manually - I must find a way to do it automatically.

Let's assume that I want to create a table based on the embedded case. I need to get the following .sql code:

select 
    content_json ->> 'item1' as end1,
    content_json -> 'item_2' ->> 'item_3' as end3
from table_with_json

Substitute content_json for "ROOT" and you can see that each line in SQL code is simply a depth-first traversal from "ROOT" to a leaf (move from the last node to leaf is always annotated with ->>).

EDIT: In order to make the question more clear I'm adding the target .sql query for the array case. I would like there to be as many queries as there are elements in the array:

select
    content_json ->> 'name' as name,
    content_json -> 'items' -> 1 -> 'item_name' as item_name,
    content_json -> 'items' -> 1 -> 'price' as price,
    content_json -> 'items' -> 1 -> 'units' as units
from table_with_json

select
    content_json ->> 'name' as name,
    content_json -> 'items' -> 2 ->> 'item_name' as item_name,
    content_json -> 'items' -> 2 ->> 'price' as price,
    content_json -> 'items' -> 2 ->> 'units' as units
from table_with_json

Solution so far (07.05.2019)

I'm testing the current solution for the moment:

from collections import OrderedDict

def treeify(data) -> dict:
    if isinstance(data, dict):  # already have keys, just recurse
        return OrderedDict((key, treeify(children)) for key, children in data.items())
    elif isinstance(data, list):  # make keys from indices
        return OrderedDict((idx, treeify(children)) for idx, children in enumerate(data, start=1))
    else:  # leave node, no recursion
        return data

def format_query(tree, stack=('content_json',)) -> str:
    if isinstance(tree, dict):  # build stack of keys
        for key, child in tree.items():
            yield from format_query(child, stack + (key,))
    else:  # print complete stack, discarding leaf data in tree
        *keys, field = stack
        path = ' -> '.join(
            str(key) if isinstance(key, int) else "'%s'" % key
            for key in keys
        )
        yield path + " ->> '%s' as %s" % (field, field)

def create_select_query(lines_list):
    query = "select\n"
    for line_number in range(len(lines_list)):
        if "_class" in lines_list[line_number]:
            # ignore '_class' fields
            continue
        query += "\t" + lines_list[line_number]
        if line_number == len(lines_list)-1:
            query += "\n"
        else:
            query += ",\n"
    query += "from table_with_json"
    return query

I'm currently working on a JSON like this:

stack_nested_example = {"_class":"value_to_be_ignored","first_key":{"second_key":{"user_id":"123456","company_id":"9876","question":{"subject":"some_subject","case_type":"urgent","from_date":{"year":2011,"month":11,"day":11},"to_date":{"year":2012,"month":12,"day":12}},"third_key":[{"role":"driver","weather":"great"},{"role":"father","weather":"rainy"}]}}}

In the output I get the only constant element is the order of lines treated with array logic. Order of other lines differs. The output I would like to get is the one that takes into account order of the keys:

select
        'content_json' -> 'first_key' -> 'second_key' ->> 'user_id' as user_id,
        'content_json' -> 'first_key' -> 'second_key' ->> 'company_id' as company_id,
        'content_json' -> 'first_key' -> 'second_key' -> 'question' ->> 'subject' as subject,
        'content_json' -> 'first_key' -> 'second_key' -> 'question' ->> 'case_type' as case_type,
        'content_json' -> 'first_key' -> 'second_key' -> 'question' -> 'from_date' ->> 'year' as year,
        'content_json' -> 'first_key' -> 'second_key' -> 'question' -> 'from_date' ->> 'month' as month,
        'content_json' -> 'first_key' -> 'second_key' -> 'question' -> 'from_date' ->> 'day' as day,
        'content_json' -> 'first_key' -> 'second_key' -> 'question' -> 'to_date' ->> 'year' as year,
        'content_json' -> 'first_key' -> 'second_key' -> 'question' -> 'to_date' ->> 'month' as month,
        'content_json' -> 'first_key' -> 'second_key' -> 'question' -> 'to_date' ->> 'day' as day,
        'content_json' -> 'first_key' -> 'second_key' -> 'third_key' -> 1 ->> 'role' as role,
        'content_json' -> 'first_key' -> 'second_key' -> 'third_key' -> 1 ->> 'weather' as weather,
        'content_json' -> 'first_key' -> 'second_key' -> 'third_key' -> 2 ->> 'role' as role,
        'content_json' -> 'first_key' -> 'second_key' -> 'third_key' -> 2 ->> 'weather' as weather
from table_with_json
balkon16
  • 1,338
  • 4
  • 20
  • 40

2 Answers2

2

In your create_tree_from_JSON you never pass on the tree during recursion. Yet you try to return it.

def create_tree_from_JSON(json, parent=None):
    if not parent:
        tree = Tree()  # tree is only created for root node
        ...
    else:
        parent = parent  # tree is not created here
    ...
    return tree  # tree is always returned

Either pass on the tree during recursion, or separate the root step from the others:

def create_tree_from_JSON(json):  # root case
    tree = Tree()
    node_0 = TreeNode("ROOT")
    tree.root = node_0
    parent = node_0
    _walk_tree(json, parent)

def _walk_tree(json, parent):  # recursive case
    for key in json:
        if isinstance(json[key], dict):
            head = TreeNode(key)
            _walk_tree(json[key], head)
        else:
            node = TreeNode(key)
            node.add_child(TreeNode(json[key]))
        parent.add_child(node)

Note that what you are doing can be solved much easier using plain dicts. Your class is effectively just wrapping a custom interface around dict to begin with.

def treeify(data) -> dict:
    if isinstance(data, dict):  # already have keys, just recurse
       return {key: treeify(children) for key, children in data.items()}
    elif isinstance(data, list):  # make keys from indices
       return {idx: treeify(children) for idx, children in enumerate(data, start=1)}
    else:  # leave node, no recursion
       return data

You can feed any decoded json data to this.

>>> treeify(json_file = { "name": "John", "items": [ { "item_name": "lettuce", "price": 2.65, "units": "no" }, { "item_name": "ketchup", "price": 1.51, "units": "litres" } ] })
{'name': 'John', 'items': {1: {'item_name': 'lettuce', 'price': 2.65, 'units': 'no'}, 2: {'item_name': 'ketchup', 'price': 1.51, 'units': 'litres'}}}

To get the desired pretty-printed output, you can walk this structure with a stack of current keys. A generator is appropriate to create each query line on the fly:

def format_query(tree, stack=('content_json',)) -> str:
    if isinstance(tree, dict):  # build stack of keys
        for key, child in tree.items():
            yield from format_query(child, stack + (key,))
    else:  # print complete stack, discarding leaf data in tree
       *keys, field = stack
       path = ' -> '.join(
           str(key) if isinstance(key, int) else "'%s'" % key
           for key in keys
       )
       yield path + " ->> '%s' as %s" % (field, field)

Given your second example, this allows you to get a list of query lines:

>>> list(format_query(treeify({ "name": "John", "items": [ { "item_name": "lettuce", "price": 2.65, "units": "no" }, { "item_name": "ketchup", "price": 1.51, "units": "litres" } ] })))
["'content_json' ->> 'name' as name",
 "'content_json' -> 'items' -> 1 ->> 'item_name' as item_name",
 "'content_json' -> 'items' -> 1 ->> 'price' as price",
 "'content_json' -> 'items' -> 1 ->> 'units' as units",
 "'content_json' -> 'items' -> 2 ->> 'item_name' as item_name",
 "'content_json' -> 'items' -> 2 ->> 'price' as price",
 "'content_json' -> 'items' -> 2 ->> 'units' as units"]
MisterMiyagi
  • 44,374
  • 10
  • 104
  • 119
  • Thank you for your input. Please note that I've changed the logic in the embedded case. The last item in the chain is the last key preceded by `->>`. – balkon16 May 02 '19 at 10:24
  • 1
    I get the `NameError: name 'keys' is not defined` while trying to replicate your answer – balkon16 May 06 '19 at 06:22
  • @balkon16 My bad, ``format_query`` should work now. – MisterMiyagi May 06 '19 at 07:34
  • Thanks for the edit. It turns out that `'content_json'` must be unquoted (`content_json`). I decided to go with a rather unelegant solution: `path = ' -> '.join(str(key) if isinstance(key, int) else "'%s'" % key for key in keys).replace("'content_json'", "content_json")` (added `replace(...` part. Do you happen to know a more elegant solution? – balkon16 May 06 '19 at 09:13
  • Another question: is a there a way to change the current implementation that uses `dict` into one that uses `collections.OrderedDict` so that the order of the fields in JSON file is preserved? – balkon16 May 06 '19 at 11:14
  • @balkon16 If you are using a sufficiently recent version of python, ``dict`` is insertion-ordered. Practically all current PyPy version should do, and CPython from 3.6 onwards as well. – MisterMiyagi May 06 '19 at 12:51
  • I'm aiming at my solution working for Python 3.5.2, which doesn't support insertion-ordered `dict`s – balkon16 May 06 '19 at 13:30
  • @balkon16 You can convert the dict comprehensions to generator expressions inside an ordered dict. Basically replace ``{a: b ...}`` by ``OrderedDict((a, b) ...)``. For example ``OrderedDict((key, treeify(children)) for key, children in data.items())`` for the first dict comprehension. – MisterMiyagi May 06 '19 at 13:40
  • even though I've added the `OrderedDict()` in both of the comprehensions but still end up with the solution (please see the edit from 07.05.2019) which doesn't keep the order of items – balkon16 May 07 '19 at 07:48
  • 1
    @balkon16 If you load your JSON to a ``dict``, you loose order there already. You must load the JSON using OrderedDict to represent objects with ordering: https://stackoverflow.com/questions/6921699/can-i-get-json-to-load-into-an-ordereddict – MisterMiyagi May 07 '19 at 08:59
1

You can use recursion:

def format_query(d):
  if all(not isinstance(i, tuple) for i in d):
    return 'select\n{}\nfrom table_with_json'.format(',\n'.join('\tcontent_json {}'.format("->> '{}' as {}".format(i[0], i[0]) if len(i) == 1 else "-> {} ->> '{}' as {}".format(' -> '.join("'{}'".format(j) for j in i[:-1]), i[-1], i[-1])) for i in d))
  return '\n\n'.join(format_query([c for b in i for c in b]) for i in d)

def get_dict(d, c = []):
  for a, b in d.items():
     if not isinstance(b, (dict, list)):
       yield c+[a]
     elif isinstance(b, dict):
       yield from to_query(b, c+[a])

def to_query(d, q = []):
  if not any(isinstance(i, list) for i in d.values()):
     yield from get_dict(d, c=q)
  else:
     _c = list(get_dict(d))
     for a, b in d.items():
       if isinstance(b, list):
         for i, j in enumerate(b, 1):
            yield (_c, list(get_dict(j, [a, i])))

Now, to format:

json_file = { "name": "John", "items": [ { "item_name": "lettuce", "price": 2.65, "units": "no" }, { "item_name": "ketchup", "price": 1.51, "units": "litres" } ] }
print(format_query(list(to_query(json_file))))

Output:

select
      content_json ->> 'name' as name,
      content_json -> 'items' -> '1' ->> 'item_name' as item_name,
      content_json -> 'items' -> '1' ->> 'price' as price,
      content_json -> 'items' -> '1' ->> 'units' as units
from table_with_json

select
     content_json ->> 'name' as name,
     content_json -> 'items' -> '2' ->> 'item_name' as item_name,
     content_json -> 'items' -> '2' ->> 'price' as price,
     content_json -> 'items' -> '2' ->> 'units' as units
from table_with_json
Ajax1234
  • 69,937
  • 8
  • 61
  • 102
  • Thanks for the comment. Your answer works for printing out the tree but as I've pointed in the question I need to generate a tree and I'm using graphical (with tabs) representation to present output I'd like to obtain :) – balkon16 Apr 30 '19 at 19:58
  • @balkon16 Opps, thank you for pointing that out. What would your SQL query look like for the input with the list? – Ajax1234 Apr 30 '19 at 20:00
  • @balkon16 Thank you. In the inputs with no arrays, the values of the final key are included in the result i.e `content_json -> 'item_1' ->> 'end1' as end1,`. However, with the array, you only keep the key, `content_json ->> 'name' as name,`, and not provide the value, `content_json -> 'name' ->> 'John' as John`. Can you clarify this for me? Thank you. – Ajax1234 May 01 '19 at 15:51
  • this is my mistake. The `.sql` query for the embedded case (no arrays) follows the same logic as the array case. – balkon16 May 02 '19 at 10:23
  • @balkon16 Thanks for your patience :) Please see my recent edit. – Ajax1234 May 02 '19 at 18:12
  • I get `SyntaxError: invalid syntax` and the following location is indicated by the Python parser: `...'\tcontent_json {}'.format(f"->> '{i[0]}' as {i[0]}" ` (the last quotation mark) – balkon16 May 06 '19 at 07:06
  • @balkon16 That is the original code used `f-strings`, which are only available in Python3.7. Please try now, as I replaced the `f-strings` with general string formatting. – Ajax1234 May 06 '19 at 14:08
  • Hmmm that's weird. Now I get the `IndexError: tuple index out of range` – balkon16 May 07 '19 at 07:09
  • @balkon16 Strange. Please post the input that raises the error. – Ajax1234 May 07 '19 at 14:23
  • It's the "Now, to format:" part of your answer: `json_file = { "name": "John", "items": [ { "item_name": "lettuce", "price": 2.65, "units": "no" }, { "item_name": "ketchup", "price": 1.51, "units": "litres" } ] }` and then `print(format_query(list(to_query(json_file))))` – balkon16 May 07 '19 at 16:59
  • @balkon16 Ah, the change from `f-string`s to `str.format` broke part of the formatting. It works for me now. Sorry for the delay! – Ajax1234 May 07 '19 at 17:08