20

Basically, I want to iterate through a file and put the contents of each line into a deeply nested dict, the structure of which is defined by the amount of whitespace at the start of each line.

Essentially the aim is to take something like this:

a
    b
        c
    d
        e

And turn it into something like this:

{"a":{"b":"c","d":"e"}}

Or this:

apple
    colours
        red
        yellow
        green
    type
        granny smith
    price
        0.10

into this:

{"apple":{"colours":["red","yellow","green"],"type":"granny smith","price":0.10}

So that I can send it to Python's JSON module and make some JSON.

At the moment I'm trying to make a dict and a list in steps like such:

  1. {"a":""} ["a"]
  2. {"a":"b"} ["a"]
  3. {"a":{"b":"c"}} ["a","b"]
  4. {"a":{"b":{"c":"d"}}}} ["a","b","c"]
  5. {"a":{"b":{"c":"d"},"e":""}} ["a","e"]
  6. {"a":{"b":{"c":"d"},"e":"f"}} ["a","e"]
  7. {"a":{"b":{"c":"d"},"e":{"f":"g"}}} ["a","e","f"]

etc.

The list acts like 'breadcrumbs' showing where I last put in a dict.

To do this I need a way to iterate through the list and generate something like dict["a"]["e"]["f"] to get at that last dict. I've had a look at the AutoVivification class that someone has made which looks very useful however I'm really unsure of:

  1. Whether I'm using the right data structure for this (I'm planning to send it to the JSON library to create a JSON object)
  2. How to use AutoVivification in this instance
  3. Whether there's a better way in general to approach this problem.

I came up with the following function but it doesn't work:

def get_nested(dict,array,i):
if i != None:
    i += 1
    if array[i] in dict:
        return get_nested(dict[array[i]],array)
    else:
        return dict
else:
    i = 0
    return get_nested(dict[array[i]],array)

Would appreciate help!

(The rest of my extremely incomplete code is here:)

#Import relevant libraries
import codecs
import sys

#Functions
def stripped(str):
    if tab_spaced:
        return str.lstrip('\t').rstrip('\n\r')
    else:
        return str.lstrip().rstrip('\n\r')

def current_ws():
    if whitespacing == 0 or not tab_spaced:
        return len(line) - len(line.lstrip())
    if tab_spaced:
        return len(line) - len(line.lstrip('\t\n\r'))

def get_nested(adict,anarray,i):
    if i != None:
        i += 1
        if anarray[i] in adict:
            return get_nested(adict[anarray[i]],anarray)
        else:
            return adict
    else:
        i = 0
        return get_nested(adict[anarray[i]],anarray)

#initialise variables
jsondict = {}
unclosed_tags = []
debug = []

vividfilename = 'simple.vivid'
# vividfilename = sys.argv[1]
if len(sys.argv)>2:
    jsfilename = sys.argv[2]
else:
    jsfilename = vividfilename.split('.')[0] + '.json'

whitespacing = 0
whitespace_array = [0,0]
tab_spaced = False

#open the file
with codecs.open(vividfilename,'rU', "utf-8-sig") as vividfile:
    for line in vividfile:
        #work out how many whitespaces at start
        whitespace_array.append(current_ws())

        #For first line with whitespace, work out the whitespacing (eg tab vs 4-space)
        if whitespacing == 0 and whitespace_array[-1] > 0:
            whitespacing = whitespace_array[-1]
            if line[0] == '\t':
                tab_spaced = True

        #strip out whitespace at start and end
        stripped_line = stripped(line)

        if whitespace_array[-1] == 0:
            jsondict[stripped_line] = ""
            unclosed_tags.append(stripped_line)

        if whitespace_array[-2] < whitespace_array[-1]:
            oldnested = get_nested(jsondict,whitespace_array,None)
            print oldnested
            # jsondict.pop(unclosed_tags[-1])
            # jsondict[unclosed_tags[-1]]={stripped_line:""}
            # unclosed_tags.append(stripped_line)

        print jsondict
        print unclosed_tags

print jsondict
print unclosed_tags
Tomcat
  • 361
  • 1
  • 2
  • 11
  • 4
    I have to quote the [Zen of Python](http://www.python.org/dev/peps/pep-0020/) "Flat is better than nested." I would change how you are doing this. There is always a better way than nesting dictionaries. Also, make sure you aren't falling into an [X Y Problem](http://meta.stackexchange.com/questions/66377/what-is-the-xy-problem). – Inbar Rose Jul 25 '13 at 12:47
  • My original way of doing it was pretty simply to generate a big long string using various rules. Would that be superior? – Tomcat Jul 25 '13 at 12:50
  • 1
    That depends on what you are trying to achieve, take a look at the [X Y Problem](http://meta.stackexchange.com/questions/66377/what-is-the-xy-problem) and make sure you are not making a similar mistake. Essentially you need to figure out what you DATA is, and build your container around that, not build a container and figure out how to put your DATA into it. Each type of container has its advantages, but using a string to store different data sets is never a good idea. – Inbar Rose Jul 25 '13 at 12:52
  • 1
    As @InbarRose said by "XY Problem", I think you should explain what you are trying to do. I really don't understand what you mean by "put the contents of each line into a deeply nested dict, the structure of which is defined by the amount of whitespace at the start of each line." – hivert Jul 25 '13 at 12:53
  • Are you trying to build some kind of http://en.wikipedia.org/wiki/Trie using dicts ? – hivert Jul 25 '13 at 12:56
  • @hivert edited question; basically yes, a tree would be perfect but the JSON python thingo seems to want a nested dict! – Tomcat Jul 25 '13 at 12:57
  • 1
    Sorry. Still not clear. Can some letter appear twice in the same columns ? We need more examples or a precise specification. – hivert Jul 25 '13 at 12:59
  • @hivert added another example, hope that helps? – Tomcat Jul 25 '13 at 13:07
  • May I suggest changing the post's title to something like "Creating a tree from an indented text file". The recursive part in particular is, I believe, already an assumption about the solution that may not even be necessary. – Nicolas78 Jul 25 '13 at 13:37
  • @Nicolas78 yeah good idea. have changed it, and altered categories – Tomcat Jul 25 '13 at 14:08
  • @InbarRose: "There is always a better way than nesting dictionaries." Well, that's your opinion, and it's wrong. [Trie](https://en.wikipedia.org/wiki/Trie) are basically extremely efficient nested dicts. You won't achieve the same efficiency with flat lists or dicts. – Eric Duminil Jun 01 '21 at 09:19

4 Answers4

14

Here is an object oriented approach based on a composite structure of nested Node objects.

Input:

indented_text = \
"""
apple
    colours
        red
        yellow
        green
    type
        granny smith
    price
        0.10
"""

a Node class

class Node:
    def __init__(self, indented_line):
        self.children = []
        self.level = len(indented_line) - len(indented_line.lstrip())
        self.text = indented_line.strip()

    def add_children(self, nodes):
        childlevel = nodes[0].level
        while nodes:
            node = nodes.pop(0)
            if node.level == childlevel: # add node as a child
                self.children.append(node)
            elif node.level > childlevel: # add nodes as grandchildren of the last child
                nodes.insert(0,node)
                self.children[-1].add_children(nodes)
            elif node.level <= self.level: # this node is a sibling, no more children
                nodes.insert(0,node)
                return

    def as_dict(self):
        if len(self.children) > 1:
            return {self.text: [node.as_dict() for node in self.children]}
        elif len(self.children) == 1:
            return {self.text: self.children[0].as_dict()}
        else:
            return self.text

To parse the text, first create a root node. Then, remove empty lines from the text, and create a Node instance for every line, pass this to the add_children method of the root node.

root = Node('root')
root.add_children([Node(line) for line in indented_text.splitlines() if line.strip()])
d = root.as_dict()['root']
print(d)

result:

{'apple': [
  {'colours': ['red', 'yellow', 'green']},
  {'type': 'granny smith'},
  {'price': '0.10'}]
}

I think that it should be possible to do it in one step, where you simply call the constructor of Node once, with the indented text as an argument.

jkokorian
  • 2,905
  • 7
  • 32
  • 47
8

Here is a recursive solution. First, transform the input in the following way.

Input:

person:
    address:
        street1: 123 Bar St
        street2: 
        city: Madison
        state: WI
        zip: 55555
    web:
        email: boo@baz.com

First-step output:

[{'name':'person','value':'','level':0},
 {'name':'address','value':'','level':1},
 {'name':'street1','value':'123 Bar St','level':2},
 {'name':'street2','value':'','level':2},
 {'name':'city','value':'Madison','level':2},
 {'name':'state','value':'WI','level':2},
 {'name':'zip','value':55555,'level':2},
 {'name':'web','value':'','level':1},
 {'name':'email','value':'boo@baz.com','level':2}]

This is easy to accomplish with split(':') and by counting the number of leading tabs:

def tab_level(astr):
    """Count number of leading tabs in a string
    """
    return len(astr)- len(astr.lstrip('\t'))

Then feed the first-step output into the following function:

def ttree_to_json(ttree,level=0):
    result = {}
    for i in range(0,len(ttree)):
        cn = ttree[i]
        try:
            nn  = ttree[i+1]
        except:
            nn = {'level':-1}

        # Edge cases
        if cn['level']>level:
            continue
        if cn['level']<level:
            return result

        # Recursion
        if nn['level']==level:
            dict_insert_or_append(result,cn['name'],cn['value'])
        elif nn['level']>level:
            rr = ttree_to_json(ttree[i+1:], level=nn['level'])
            dict_insert_or_append(result,cn['name'],rr)
        else:
            dict_insert_or_append(result,cn['name'],cn['value'])
            return result
    return result

where:

def dict_insert_or_append(adict,key,val):
    """Insert a value in dict at key if one does not exist
    Otherwise, convert value to list and append
    """
    if key in adict:
        if type(adict[key]) != list:
            adict[key] = [adict[key]]
        adict[key].append(val)
    else:
        adict[key] = val
kalu
  • 2,594
  • 1
  • 21
  • 22
  • 2
    Could you provide the codes to translate input to `First-step output`? Thanks. – Ursa Major Jan 29 '16 at 01:01
  • In case anyone is interested, I've created a similar [C# implementation](http://stackoverflow.com/a/36998605/107625). – Uwe Keim Jun 22 '16 at 12:48
  • Here's a [highly related question](http://stackoverflow.com/questions/38664465/creating-a-tree-deeply-nested-dict-with-lists-from-an-indented-text-file). Any chance you can help? – zelusp Jul 29 '16 at 17:34
3

The following code will take a block-indented file and convert into an XML tree; this:

foo
bar
baz
  ban
  bal

...becomes:

<cmd>foo</cmd>
<cmd>bar</cmd>
<block>
  <name>baz</name>
  <cmd>ban</cmd>
  <cmd>bal</cmd>
</block>

The basic technique is:

  1. Set indent to 0
  2. For each line, get the indent
  3. If > current, step down and save current block/ident on a stack
  4. If == current, append to current block
  5. If < current, pop from the stack until you get to the matching indent

So:

from lxml import builder
C = builder.ElementMaker()

def indent(line):
    strip = line.lstrip()
    return len(line) - len(strip), strip

def parse_blockcfg(data):
    top = current_block = C.config()
    stack = []
    current_indent = 0

    lines = data.split('\n')
    while lines:
        line = lines.pop(0)
        i, line = indent(line)

        if i==current_indent:
            pass

        elif i > current_indent:
            # we've gone down a level, convert the <cmd> to a block
            # and then save the current ident and block to the stack
            prev.tag = 'block'
            prev.append(C.name(prev.text))
            prev.text = None
            stack.insert(0, (current_indent, current_block))
            current_indent = i
            current_block = prev

        elif i < current_indent:
            # we've gone up one or more levels, pop the stack
            # until we find out which level and return to it
            found = False
            while stack:
                parent_indent, parent_block = stack.pop(0)
                if parent_indent==i:
                    found = True
                    break
            if not found:
                raise Exception('indent not found in parent stack')
            current_indent = i
            current_block = parent_block

        prev = C.cmd(line)
        current_block.append(prev)

    return top
Realist
  • 509
  • 2
  • 13
0

First of all, don't use array and dict as variable names because they're reserved words in Python and reusing them may end up in all sorts of chaos.

OK so if I get you correctly, you have a tree given in a text file, with parenthood indicated by indentations, and you want to recover the actual tree structure. Right?

Does the following look like a valid outline? Because I have trouble putting your current code into context.

result = {}
last_indentation = 0
for l in f.xreadlines():
   (c, i) = parse(l) # create parse to return character and indentation
   if i==last_indentation:
   # sibling to last
   elif i>last_indentation:
   # child to last
   else:
   # end of children, back to a higher level

OK then your list are the current parents, that's in fact right - but I'd keep them pointed to the dictionary you've created, not the literal letter

just starting some stuff here

result = {}
parents = {}
last_indentation = 1 # start with 1 so 0 is the root of tree
parents[0] = result
for l in f.xreadlines():
   (c, i) = parse(l) # create parse to return character and indentation
   if i==last_indentation:
       new_el = {}
       parents[i-1][c] = new_el
       parents[i] = new_el
   elif i>last_indentation:
   # child to last
   else:
   # end of children, back to a higher level
Nicolas78
  • 5,124
  • 1
  • 23
  • 41
  • thanks! if json.dumps took a format that wasn't a dict i'd be happier :P – Tomcat Jul 25 '13 at 13:11
  • ok so my advice would be to factor out the file reading and parsing stuff, and replace the main loop with something similar to the above. in particular, this means you won't need recursion. – Nicolas78 Jul 25 '13 at 14:15