3

I have tree-like data that is constructed of Parent codes that contain Child codes which may act as parents themselves, depending on if they're marked as "SA". This data is present in an Excel sheet and looks like follows:

| Tree Level (A) | Code (B) | Spec (C) | Comm. Code (D) | Parent Code (J) |
|----------------|----------|----------|----------------|-----------------|
|              1 | A12      |        1 | SA             | Mach            |
|              2 | B41      |        2 | SA             | A12             |
|              3 | A523     |        1 | BP             | B41             |
|              2 | G32      |        4 | BP             | A12             |
|              2 | D3F5     |        1 | SA             | A12             |
|              3 | A12      |        4 | SA             | D3F5            |
|              3 | A12      |        1 | SA             | D3F5            |

There is one issue here: A12, at the top tree level (1), contains a child (D3F5), which itself contains another parent that's the same as D3F5's own parent. As you may be able to imagine, this (although not represented in the data as it is delivered to me) creates an endless loop, where A12 at tree level 3 unfolds the entire structure again and again.

Note that one of the two 'A12' children poses no problem, as this has a different specification as to the A12 parent at tree level 1.

I have a function that checks for this situation, but it is extremely slow as it uses nested loops to go through the rows, and the total row count can be several 1000s. The end goal is to show the user the deepest level at which the error occurs. In this example, that would be code A12 with spec 1 at tree level 3:

def nested_parent(sht):
    """
    Checks if a parent SA contains itself as a child.
    :return: nested_parents: Dictionary of found 'nested parents'. None if none found
    """
    nested_parents = {}
    found = False

    lrow = sht.Cells(sht.Rows.Count, 1).End(3).Row
    parent_treelevel = 1

    # Get deepest tree level, as this no longer contains children
    last_treelevel = int(max([i[0] for i in sht.Range(sht.Cells(2, 1), sht.Cells(lrow, 1)).Value]))

    # Loop through parent rows
    print('Checking for nested parents...')
    for i in range(2, lrow):
        if sht.Cells(i, "D").Value == "SA":
            parent_code, parent_treelevel = f'{sht.Cells(i, "B").Value}_{sht.Cells(i, "C")}', sht.Cells(i, "A").Value

            # Add new key with list containing parent's tree level for parent code
            if parent_code not in nested_parents:
                nested_parents[parent_code] = [int(parent_treelevel)]

            # Loop child rows
            for j in range(i + 1, lrow + 1):
                child_code, child_treelevel = f'{sht.Cells(j, "B").Value}_{sht.Cells(j, "C")}', sht.Cells(i, "A").Value

                if child_code == parent_code and child_treelevel > parent_treelevel:
                    found = True
                    nested_parents[parent_code].append(int(child_treelevel))

        if parent_treelevel == last_treelevel:
            # End function if deepst tree level is reached
            print("done")
            if found:
                # Delete keys that contain no information
                delkeys = []
                for key in reversed(nested_parents):
                    if len(nested_parents[key]) == 1:
                        delkeys.append(key)
                for key in delkeys:
                    del nested_parents[key]
                return nested_parents
            else:
                return

This function can be called as follows, where wb_name is the name of the workbook that contains the data:

from win32com.client import GetObject
wb_name = "NAME"
sht = GetObject(None, "Excel.Application").Workbooks(wb_name).Worksheets(1)


def err(msg):
    """
    stops the code from executing after printing an error message
    """
    print("Unexpected error occured:", msg)
    exit()


infloop = nested_parent(sht)
if infloop is not None:
    dict_str = ''.join([f'Code: {key}, Tree levels: {infloop[key]}\n' for key in infloop])
    err(f"Warning: one or more parent codes contain their own code as a child:\n{dict_str}")

I am hoping to speed up this code, as the rest of my script is fairly quick and its speed is being seriously hampered by this function.

Tim Stack
  • 3,209
  • 3
  • 18
  • 39
  • Are loops valid in the dataset? Or are you catching this in order to raise an error back to the user? – a'r May 28 '20 at 10:11
  • Well constructed question, but since it's actually working it may possibly be better at place over at [CodeReview](https://codereview.stackexchange.com/)? I'll leave that up to you. – JvdV May 28 '20 at 10:11
  • @a'r if I understand your question correctly, endless loops such as the one explained in my question should not be there and therefore the error ought to be raised – Tim Stack May 28 '20 at 10:40
  • @JvdV thanks. I considered that, but seeing how this code is theoretically working but would be useless in a practical sense, I decided to post it here in SO instead – Tim Stack May 28 '20 at 10:41
  • I'd suggest building a directed graph and then look for cycles . See: https://stackoverflow.com/questions/261573/best-algorithm-for-detecting-cycles-in-a-directed-graph – a'r May 28 '20 at 15:49
  • I am not familiar with directed graphs, @a'r. If you could construct an answer from your comment, it would help tremendously – Tim Stack May 29 '20 at 10:06
  • Have you considered using xml or json for your data? They are designed to handle hierarchical dat such as parent/child relationships. – Matt L. Jun 05 '20 at 18:49
  • I have no experience with either @MattL., so do elaborate – Tim Stack Jun 05 '20 at 18:51
  • @TimStack, I can put together a little example, but it would help know what your ultimate end goal is. It appears from the code that you want to print out each code with a report stating which levels it appears. Is that right? – Matt L. Jun 05 '20 at 18:57
  • Correct. The code should print a message that shows the Code and Tree Level's that have the issue. If it helps, you can try to run the code on some mockup Excel sheet. – Tim Stack Jun 05 '20 at 19:02
  • @Tim if you want to stick with your current approach, switching to Variant Arrays will speed things up enormously – chris neilsen Jun 05 '20 at 20:36
  • @chrisneilsen make it an answer and we'll see! – Tim Stack Jun 10 '20 at 09:51

2 Answers2

3

As @a'r mentioned, your "tree like data" can be seen as a directed graph, i.e. points (nodes) connected with arrows (directed edges). There is a very powerful library called networkx that is dealing with graphs very nicely. Without getting too deep into graph theory, consider the following code example:

import networkx as nx

edges = [ ('A12', 'Mach'), 
          ('B41', 'A12'),
          ('A523','B41'),
          ('G32', 'A12'),
          ('D3F5','A12'),
          ('A12', 'D3F5'),
          ('A12', 'D3F5') ]

G = nx.DiGraph(edges)
cycles_list = list(nx.simple_cycles(G))
print(cycles_list)

Output:

[['A12', 'D3F5']]

Here nodes names are codes themselves as you read them, and edges are the connections between a child and a parent. You can create list of edges easily by just taking the corresponding columns of your Excel file. The exact direction (a parent to a child or vice versa) in this case is not very important, just remain consistent.

simple_cycles returns a generator. Here you can find the documentation on it.

update

Once you've got your list of loops, to find deepest node, you need to match the node and find it's deepest appearance.

Create a list of your nodes from columns A, B and J. It will look like:

data = [
   [1, 'A12', 'Mach'],
   [2, 'B41', 'A12'],
   [3, 'A523', 'B41'],
   [2, 'G32', 'A12'],
   [2, 'D3F5', 'A12'],
   [3, 'A12', 'D3F5'],
   [3, 'A12', 'D3F5'] ]

result = {}
for entry in data:
    for el in cycles_list:
        if entry[1:] == el:
            key = tuple(el)
            result[key] = max(result.setdefault(key, 0), entry[0])
print(result)

>>>
{('A12', 'D3F5'): 3}

Now you will get a dictionary where key is the problematic node and value is the deepest level it can be found at.

igrinis
  • 12,398
  • 20
  • 45
  • Thank you very much for this answer, as it is very easy to use and works well. A downside I found is the ordering of the output is seemingly random. This poses a problem, as I have to show the user the deepest node in my structure where the error occurs. How would I tackle this? – Tim Stack Jun 08 '20 at 16:32
  • What do you mean "deepest"? And why only single one? I thought you need to remove all the loops. – igrinis Jun 08 '20 at 20:14
  • The Excel files I get are merely output from a database I don't have writing rights to. The goal is to alert the user about issues such as these so they can resolve them in the database. With 'deepest', I mean the lowest tree level where the issue occurs. In my example that is code A12 with spec 1 at tree level 3. It could be however that this code is encountered a few levels deeper, though, and then *that* is what must be marked. Please run the code I supplied with my post to see what the output is really like. – Tim Stack Jun 10 '20 at 07:04
0

I hope this response will help demonstrate the power of a hierarchical data structure. What I have done is rewrite the data as a json string and then written code to walk through the hierarchy and generate the report. You would still have the task of converting excel to json. The main point is that each level of the json has the same keys, and that each child in children has the same keys as its parent dictionary, so that enables a recursive function to traverse the structure. I did examples to total by codes or levels.

import json 
json_data = """
{
    "level": 0,
    "code": "Mach",
    "children": [
        {
            "level": 1,
            "code": "A12",
            "children": [
                {
                    "level": 2,
                    "code": "B41",
                    "children": [
                        {
                            "level": 3,
                            "code": "A523",
                            "children": []
                        }
                    ]
                },
                {
                    "level": 2,
                    "code": "G32",
                    "children": []
                },
                {
                    "level": 2,
                    "code": "D3F5",
                    "children": [
                        {
                            "level": 3,
                            "code": "A12",
                            "children": []
                        },
                        {
                            "level": 3,
                            "code": "A12",
                            "children": []
                        }
                    ]
                }
            ]
        }
    ]
}
"""

data = json.loads(json_data)

def crawl_levels(mydict, result={}):
    try:
        result[mydict["level"]].append(mydict["code"])
    except:
        result[mydict["level"]] = [mydict["code"],]

    for i in mydict["children"]:
        result = crawl_levels(i, result=result)
    return result

crawl_levels(data) 
>>>{0: ['Mach'], 1: ['A12'], 2: ['B41', 'G32', 'D3F5'], 3: ['A523', 'A12', 'A12']}

def crawl_codes(mydict, result={}):
    try:
        result[mydict["code"]].append(mydict["level"])
    except:
        result[mydict["code"]] = [mydict["level"],]

    for i in mydict["children"]:
        result = crawl_codes(i, result=result)
    return result

crawl_codes(data) 
>>>{'Mach': [0],
 'A12': [1, 3, 3],
 'B41': [2],
 'A523': [3],
 'G32': [2],
 'D3F5': [2]}
Matt L.
  • 3,431
  • 1
  • 15
  • 28