0

I've seen that there are a fair few questions addressing more or less this issue, but I've not managed to apply them to my specific use-case, and I've been scratching my head and trying different solutions for a couple of days now.

I have a list of dictionaries, with their hierarchical position encoded as a string of index numbers - I want to rearrange the dictionaries into a nested hierarchy using these indices.

Here's some example data:

my_data = [{'id':1, 'text':'one', 'path':'1'},
           {'id':2, 'text':'two', 'path':'3.1'},
           {'id':3, 'text':'three', 'path':'2.1.1'},
           {'id':4, 'text':'four', 'path':'3.2.1'},
           {'id':5, 'text':'five', 'path':'2.1.2'},
           {'id':6, 'text':'six', 'path':'3.2.2'},
           {'id':7, 'text':'seven', 'path':'2'},
           {'id':8, 'text':'eight', 'path':'3'},
           {'id':9, 'text':'nine', 'path':'3.2'},
           {'id':10, 'text':'ten', 'path':'2.1'}]

and here's what I'm trying to achieve:

result = {1:{'id':1, 'text':'one', 'path':'1'},
          2:{'id':7, 'text':'seven', 'path':'2', 'children':{
              1:{'id':10, 'text':'ten', 'path':'2.1', 'children':{
                  1:{'id':3, 'text':'three', 'path':'2.1.1'},
                  2:{'id':5, 'text':'five', 'path':'2.1.2'}
                  }}}},
          3:{'id':8, 'text':'eight', 'path':'3', 'children':{
              1:{'id':2, 'text':'two', 'path':'3.1'},
              2:{'id':9, 'text':'nine', 'path':'3.2', 'children':{
                  1:{'id':4, 'text':'four', 'path':'3.2.1'},
                  2:{'id':6, 'text':'six', 'path':'3.2.2'}
                  }}}}
          }

Since the paths of the individual data dictionaries don't appear in any logical order, I'm using dictionaries throughout rather than lists of dictionaries, as this allows me to create 'empty' spaces in the structure. I don't really want to rely on re-ordering the dictionaries in the initial list.

Here's my code:

#%%
class my_dict(dict):
    def rec_update(self, index, dictObj): # extend the dict class with recursive update function
        """
                Parameters
        ----------
        index : list
            path to dictObj.
        dictObj : dict
            data object.

        Returns: updates the dictionary instance
        -------
        None.

        """  
        pos = index[0]
        index.pop(0)
        if len(index) != 0:
            self.update({pos : {'children' : {self.rec_update(index, dictObj)}}})
        else:
            self.update({pos : dictObj})

#%%
dataOut = my_dict() #create empty dictionary to receive result
dataOut.clear()

# dictObj = my_data[0] # for testing
# dictObj = my_data[1]

for dictObj in my_data:
    index = dictObj.get('path').split(".") # create the path list
    dataOut.rec_update(index, dictObj) # place the current data dictionary in the hierarchy

The issue with the code is that the result of the nested function call in the class definition self.rec_update(index, dictObj) isn't ending up as the value of the 'children' key. Is this because I've not understood the scope of self properly?

I've noticed during testing that, if I run the dataOut.rec_update(index, dictObj) call for a single element of my_data, e.g. dictObj = my_data[1], that the index list variable in the console scope is modified, which is unexpected, as I thought the rec_update() function had its own distinct scope.

I think I can see a further bug where the 'children' element will be overwritten, but I'm not at that stage yet.

I'd welcome any explanation that can put me on the right track, please.

Maicon Mauricio
  • 2,052
  • 1
  • 13
  • 29
  • I'm not sure if you missed it or not, but in case you did, my solution does *not* change the source list, or its dictionaries, in any way. Any objects that it modifies are objects that it created. But as I said, the approach in your post isn't the best way to do this, and it lacks some of the properties that I consider mandatory, which is why I put together a simpler solution from scratch. – Tom Karzes Jul 11 '21 at 00:57
  • @TomKarzes: I saw that the source remains untouched, but my spec was: _not to re-order it_. I actually want to understand why the recursive approach I'm testing doesn't work, hence my questions about scope (and observation regarding the index list). I'm not asking for a 'solution' but I'm looking to understand python better. I've simplified the code before posting here, as it does a bunch of other stuff (including converting the index from strings to integers), but I've deliberately left that out, to get at the core issues regarding __recursion__, __scope__, __extended classes__, and __self__. – jack_sprat Jul 11 '21 at 09:14
  • Regarding the "no reordering" condition: Python 3.7 was the first version of Python that guaranteed dictionary order. Prior to that, there were no guarantees so the concept of dictionary order didn't even exist. Looking at the example, I see that, within a given sub-level, the nodes are already in path order. But if, for example, 2.1.2 came *before* 2.1.1 in the original data, would you want that reflected in the (ordered) dict? It could certainly be done, but it means not sorting the data. It would just need to ensure that parent nodes are created before their children. – Tom Karzes Jul 11 '21 at 09:42
  • If you're interested, you can read about the dictionary ordering guarantee [here](https://stackoverflow.com/questions/39980323/are-dictionaries-ordered-in-python-3-6). – Tom Karzes Jul 11 '21 at 09:46

2 Answers2

0

Here's a solution that you should be able to adapt to your needs. It's just a stand-alone function that transforms my_data into result:

def make_tree(data):
    ###
    ### Construct path_list and path_dict
    ###

    path_dict = {}
    path_list = []

    for data in data:
        path = data['path']
        path_split = path.split('.')
        assert len(path_split) >= 1

        path_tuple = tuple(map(int, path_split))
        assert path_tuple not in path_dict
        path_dict[path_tuple] = data
        path_list.append(path_tuple)

    ###
    ### Sort path_list.  This is sorting the tuples corresponding to
    ### each path value.  Among other things, this ensues that the
    ### parent of a path appears before the path.
    ###

    path_list.sort()

    ###
    ### Construct and return the tree
    ###

    new_path_dict = {}
    tree = {}

    for path_tuple in path_list:
        data = path_dict[path_tuple]
        path_leaf = path_tuple[-1]

        new_data = data.copy()

        if len(path_tuple) == 1:
            assert path_leaf not in tree
            tree[path_leaf] = new_data
        else:
            parent_path_tuple = path_tuple[:-1]
            assert parent_path_tuple in new_path_dict
            parent = new_path_dict[parent_path_tuple]

            if 'children' not in parent:
                children = {}
                parent['children'] = children
            else:
                children = parent['children']

            assert path_leaf not in children
            children[path_leaf] = new_data

        new_path_dict[path_tuple] = new_data

    return tree

When called as:

result = make_tree(my_data)

It gives result the value:

{1: {'id': 1, 'text': 'one', 'path': '1'},
 2: {'id': 7, 'text': 'seven', 'path': '2', 'children': {
     1: {'id': 10, 'text': 'ten', 'path': '2.1', 'children': {
         1: {'id': 3, 'text': 'three', 'path': '2.1.1'},
         2: {'id': 5, 'text': 'five', 'path': '2.1.2'}}}}},
 3: {'id': 8, 'text': 'eight', 'path': '3', 'children': {
     1: {'id': 2, 'text': 'two', 'path': '3.1'},
     2: {'id': 9, 'text': 'nine', 'path': '3.2', 'children': {
         1: {'id': 4, 'text': 'four', 'path': '3.2.1'},
         2: {'id': 6, 'text': 'six', 'path': '3.2.2'}}}}}}

Note that Python 3 dictionaries maintain the order of added elements, so in that sense, the constructed tree is "sorted" at each level by the corresponding path component.

Also note that the original source list, and the dictionaries it contains, are unchanged by this function.

Tom Karzes
  • 22,815
  • 2
  • 22
  • 41
  • wow - yes that looks like it produces the result that I'm looking for, but... I'm trying to get a feel for why my (more compact) approach is failing. (sidenote: I'm attempting to learn stuff about python here, hence the questions about scope, and *self*, and deliberately extending a class, and not wanting to re-order the original list). There's definitely lots to digest in your answer, so thanks very much. – jack_sprat Jul 11 '21 at 00:31
  • @jack_sprat I tried to keep it as simple as possible, and I put a bunch of assertions in so that you'd know right away if any assumptions were violated by the data. The two-pass approach solves two problems: (1) it ensures that parent nodes are always added before child nodes, and (2) it ensures that the nodes are sorted by their path component at each level. Note that I did not need to use recursion for this solution. – Tom Karzes Jul 11 '21 at 00:34
  • @jack_sprat Note that some things are missing from your solution. For instance, I don't think it's ever converting the path components to integers, which you don't want. Among other things, it leaving them as strings would mean that `'10'` comes before `'2'`, since `'10' < '2'`, even though `2 < 10`. I fixed all this in my solution. – Tom Karzes Jul 11 '21 at 00:39
  • @jack_sprat Anyway, I think it's fairly minimal, given that it has the desired assertions and sorting behavior. You could compress it a bit by using `defaultdict` etc., but as I said, I wanted to keep it basic and simple. – Tom Karzes Jul 11 '21 at 00:42
  • @jack_sprat Also note that this does *not* change the original list or data dicts in any way. Any changes are done to copies or temporary objects that are used to produce the tree. You mentioned that you did not want to reorder the original list. This does not reorder it. – Tom Karzes Jul 11 '21 at 00:43
  • @jack_sprat Anyway, I believe this solution satisfies all of the stated requirements, and several that were not stated. – Tom Karzes Jul 11 '21 at 01:14
0

I think I've cracked it! And I've learned a lot in the process. (I'd hope so - I've got at least 24 SO tabs open, 6 doc.python.org tabs, and maybe 20 others - so it's been a group effort!)

Here is a recursive function that creates the required nested data:

class my_dict(dict):                                                    # new class inherits dict()
    def rec_update(self, index, dictObj): 
        pos = index[0]                                                  # store the first index position
        index.pop(0)                                                    # remove the first position from the list
        dictTemp = my_dict()                                            # will be used to pass the nested branch to the recursive function - doesn't need defined here
        if len(index) != 0:                                             # ... then we've not arrived at the leaf yet
            if not (pos in self and 'children' in self[pos]):           # ... then create a new branch
                self[pos] = {'children': {}}                            # create template
            dictTemp = my_dict(self[pos]['children'])                   # cast the dictionary as my_dict so that it has access to the rec_update() function
            self[pos]['children'] = dictTemp.rec_update(index, dictObj) # pass data on to next level, and recurse
        else:
            if (pos in self and 'children' in self[pos]):               # ... then update existing branch
                self[pos].update(dictObj)                               # add in the data alongside pre-existing children key
            else:                                                       # populate new branch with data, finally!
                self[pos] = dictObj
        return self

and here is the calling code:

dataOut = my_dict()

for dictObj in my_data:
    index = [int(i) for i in dictObj.get('path').split(".")] # turn path string into list and iterate; convert to integers
    dataOut.rec_update(index, dictObj)

I still don't understand why changes to index inside the function alter index in the calling code - answers welcome :-)

But I did discover that I couldn't override dict.copy() with a __copy__() function inside my my_dict class definition, hence dictTemp = my_dict(self[pos]['children']) rather than dictTemp = self[pos]['children'].copy().

One final oddity which I've still to address: when I apply it to my production data, I have to run it twice!