Let's say that we have the following JSON file. For the sake of the example it's emulated by a string. The string is the input and a Tree
object should be the output. I'll be using the graphical notation of a tree to present the output.
I've found the following classes to handle tree concept in Python:
class TreeNode(object):
def __init__(self, data):
self.data = data
self.children = []
def add_child(self, obj):
self.children.append(obj)
def __str__(self, level=0):
ret = "\t"*level+repr(self.data)+"\n"
for child in self.children:
ret += child.__str__(level+1)
return ret
def __repr__(self):
return '<tree node representation>'
class Tree:
def __init__(self):
self.root = TreeNode('ROOT')
def __str__(self):
return self.root.__str__()
The input file can be of different complexity:
Simple case
Input:
json_file = '{"item1": "end1", "item2": "end2"}'
Output:
"ROOT"
item1
end1
item2
end2
Embedded case
Input:
json_file = {"item1": "end1", "item2": {"item3": "end3"}}
Output:
"ROOT"
item1
end1
item2
item3
end3
Array case
Input:
json_file = { "name": "John", "items": [ { "item_name": "lettuce", "price": 2.65, "units": "no" }, { "item_name": "ketchup", "price": 1.51, "units": "litres" } ] }
Output:
"ROOT"
name
John
items
1
item_name
lettuce
price
2.65
units
no
2
item_name
ketchup
price
1.51
units
litres
Please note that each item in an array is described with an integer (starting at 1).
So far I've managed to come up with the following function that solves the problem for the simple case. In terms of the embedded case I know that I must use recursion but so far I get UnboundLocalError: local variable 'tree' referenced before assignment
.
def create_tree_from_JSON(json, parent=None):
if not parent:
tree = Tree()
node_0 = TreeNode("ROOT")
tree.root = node_0
parent = node_0
else:
parent = parent
for key in json:
if isinstance(json[key], dict):
head = TreeNode(key)
create_tree_from_JSON(json[key], head)
else:
node = TreeNode(key)
node.add_child(TreeNode(json[key]))
parent.add_child(node)
return tree
Problem's background
You may wonder why would I need to change a JSON object into a tree. As you may know PostgreSQL provides a way to handle JSON fields in the database. Given a JSON object I can get the value of any field by using ->
and ->>
notation. Here and here more about the subject. I will be creating new tables based on the fields' names and values. Unfortunately the JSON objects vary to such an extent that I cannot write the .sql
code manually - I must find a way to do it automatically.
Let's assume that I want to create a table based on the embedded case. I need to get the following .sql
code:
select
content_json ->> 'item1' as end1,
content_json -> 'item_2' ->> 'item_3' as end3
from table_with_json
Substitute content_json
for "ROOT"
and you can see that each line in SQL code is simply a depth-first traversal from "ROOT" to a leaf (move from the last node to leaf is always annotated with ->>
).
EDIT: In order to make the question more clear I'm adding the target .sql
query for the array case. I would like there to be as many queries as there are elements in the array:
select
content_json ->> 'name' as name,
content_json -> 'items' -> 1 -> 'item_name' as item_name,
content_json -> 'items' -> 1 -> 'price' as price,
content_json -> 'items' -> 1 -> 'units' as units
from table_with_json
select
content_json ->> 'name' as name,
content_json -> 'items' -> 2 ->> 'item_name' as item_name,
content_json -> 'items' -> 2 ->> 'price' as price,
content_json -> 'items' -> 2 ->> 'units' as units
from table_with_json
Solution so far (07.05.2019)
I'm testing the current solution for the moment:
from collections import OrderedDict
def treeify(data) -> dict:
if isinstance(data, dict): # already have keys, just recurse
return OrderedDict((key, treeify(children)) for key, children in data.items())
elif isinstance(data, list): # make keys from indices
return OrderedDict((idx, treeify(children)) for idx, children in enumerate(data, start=1))
else: # leave node, no recursion
return data
def format_query(tree, stack=('content_json',)) -> str:
if isinstance(tree, dict): # build stack of keys
for key, child in tree.items():
yield from format_query(child, stack + (key,))
else: # print complete stack, discarding leaf data in tree
*keys, field = stack
path = ' -> '.join(
str(key) if isinstance(key, int) else "'%s'" % key
for key in keys
)
yield path + " ->> '%s' as %s" % (field, field)
def create_select_query(lines_list):
query = "select\n"
for line_number in range(len(lines_list)):
if "_class" in lines_list[line_number]:
# ignore '_class' fields
continue
query += "\t" + lines_list[line_number]
if line_number == len(lines_list)-1:
query += "\n"
else:
query += ",\n"
query += "from table_with_json"
return query
I'm currently working on a JSON like this:
stack_nested_example = {"_class":"value_to_be_ignored","first_key":{"second_key":{"user_id":"123456","company_id":"9876","question":{"subject":"some_subject","case_type":"urgent","from_date":{"year":2011,"month":11,"day":11},"to_date":{"year":2012,"month":12,"day":12}},"third_key":[{"role":"driver","weather":"great"},{"role":"father","weather":"rainy"}]}}}
In the output I get the only constant element is the order of lines treated with array logic. Order of other lines differs. The output I would like to get is the one that takes into account order of the keys:
select
'content_json' -> 'first_key' -> 'second_key' ->> 'user_id' as user_id,
'content_json' -> 'first_key' -> 'second_key' ->> 'company_id' as company_id,
'content_json' -> 'first_key' -> 'second_key' -> 'question' ->> 'subject' as subject,
'content_json' -> 'first_key' -> 'second_key' -> 'question' ->> 'case_type' as case_type,
'content_json' -> 'first_key' -> 'second_key' -> 'question' -> 'from_date' ->> 'year' as year,
'content_json' -> 'first_key' -> 'second_key' -> 'question' -> 'from_date' ->> 'month' as month,
'content_json' -> 'first_key' -> 'second_key' -> 'question' -> 'from_date' ->> 'day' as day,
'content_json' -> 'first_key' -> 'second_key' -> 'question' -> 'to_date' ->> 'year' as year,
'content_json' -> 'first_key' -> 'second_key' -> 'question' -> 'to_date' ->> 'month' as month,
'content_json' -> 'first_key' -> 'second_key' -> 'question' -> 'to_date' ->> 'day' as day,
'content_json' -> 'first_key' -> 'second_key' -> 'third_key' -> 1 ->> 'role' as role,
'content_json' -> 'first_key' -> 'second_key' -> 'third_key' -> 1 ->> 'weather' as weather,
'content_json' -> 'first_key' -> 'second_key' -> 'third_key' -> 2 ->> 'role' as role,
'content_json' -> 'first_key' -> 'second_key' -> 'third_key' -> 2 ->> 'weather' as weather
from table_with_json