3

I have the following tree represented in parent/child relationships:

import pandas as pd
df = pd.DataFrame(columns=['Parent','Child'])
df['Parent']=["A","A","A","B","B","B","C","C","F","G","G"]
df['Child']=["B","C","E","D","E","F","F","G","H","H","I"]

Some of the nodes have multiple parents. This should be removed by giving these common children different IDs based on the path. This is how it should look like after (right tree):

enter image description here

I wrote a funciton tha is supposed to write a path for every node and add it to the name, the results are collected in the dictionary "res". After a few day of trying this seems to be a bad approach as it doeas not split the paths. Below an example for node H.

Any ideas how the tree can be transformed?

res = {}
def find_parent(child, path):

    path.append(str(child))
    print("Path:    ", path)

    parents = df.loc[df['Child'] == child, ['Parent']]['Parent'].tolist()
    print("Parents: ",parents)

    if not parents:
        print("Path end reached!")
        print("Result: ", res)
        # i+1

    else :
        for i in range(0,len(parents)-1):
            if len(parents)>1: #dann neue paths
                path = [(str(child))]

                new_path = 'path_{}'.format(i)
                print("-->add ",parents[i])
                res[new_path] = str(''.join(path)) + parents[i]

                print("Result: ", res)
                print()
                find_parent(parents[i], path)

            else: 
                new_path = 'path_{}'.format(i)
                print("-->add ",parents[i])
                res[new_path] = str(''.join(path)) + parents[i]

                print("Result: ", res)
                print()

                find_parent(parents[0],path)

    return res

Example result for node "H"

find_parent("H", [])

{'path_0': 'FB'}

It should give H_FBA, HFCA and H_GCA.

Grapheneer
  • 897
  • 8
  • 25

1 Answers1

3

You can do this with a variety of typical graph transformations. Before diving into the code, we're working with a directed acyclic graph with one source node (that is, node with no incoming edges) in the original dataframe. These properties simplify matters quite a bit and guarantee that we can create a unique tree as shown in your desired output.

The flat dataframe isn't terribly easy to manipulate as a graph, so the first step I took was to turn it into an adjacency list.

Next, there are a few preparatory steps to perform:

  • Find the source node (although I used a generic graph function that handles more than one source, we're guaranteed there'll only be one source so we'll take the only item from the returned set). The lone source node will become the root of the new tree.
  • Run a function to extract all of the paths in the DAG starting from the source.
  • Create a dict that maps node names to incoming edge counts which helps guide the renaming procedure.
  • Create a renaming dictionary that lets us keep track of which nodes have been renamed.

Having built these structures, iterate over all possible paths in the graph to build the relabeled graph, formatting nodes with multiple incoming edges according to your spec (a reversed path node list) and storing the new names in the renaming dict.

Finally, sort a flattened version of this relabeled graph and dump it in the result dataframe.

Here's the code:

import pandas as pd
from collections import defaultdict

def find_source_nodes(graph):
    sources = set(graph)

    for neighbors in graph.values():
        sources = sources - neighbors

    return sources

def count_incoming_edges(graph):
    counts = defaultdict(int)

    for neighbors in graph.values():
        for neighbor in neighbors:
            counts[neighbor] += 1

    return counts

def find_all_paths_in_dag(graph, src, path=[]):
    path.append(src)

    if src in graph and graph[src]:
        for neighbor in graph[src]:
            yield from find_all_paths_in_dag(graph, neighbor, path)
    else:
        yield path[:]

    path.pop()

def flatten_adjacency_list(adj_list):
    flat_graph = []

    for parent, children in adj_list.items():
        flat_graph.extend([(parent, child) for child in children])

    return flat_graph

def relabel_dag(graph, root):
    relabeled_graph = defaultdict(set)
    all_paths = find_all_paths_in_dag(graph, root)
    incoming_edge_counts = count_incoming_edges(graph)
    renamed = {k: k for k in graph}

    for path in all_paths:
        for src, dst, i in zip(path, path[1:], range(len(path) - 1)):
            if incoming_edge_counts[dst] > 1:
                renamed[dst] = dst = f"{dst}_{''.join(path[:i+1][::-1])}"

            relabeled_graph[renamed[src]].add(dst)

    return relabeled_graph

if __name__ == "__main__":
    df = pd.DataFrame(columns=["Parent", "Child"])
    df["Parent"] = ["A", "A", "A", "B", "B", "B", "C", "C", "F", "G", "G"]
    df["Child"] = ["B", "C", "E", "D", "E", "F", "F", "G", "H", "H", "I"]
    graph = defaultdict(set)

    for parent, child in zip(df["Parent"], df["Child"]):
        graph[parent].add(child)

    root = next(iter(find_source_nodes(graph)))
    relabeled_tree = relabel_dag(graph, root)
    flat_relabeled_tree = sorted(flatten_adjacency_list(relabeled_tree))
    relabeled_df = pd.DataFrame(flat_relabeled_tree, columns=["Parent", "Child"])
    print(relabeled_df)

Output:

   Parent  Child
0       A      B
1       A      C
2       A    E_A
3       B      D
4       B   E_BA
5       B   F_BA
6       C   F_CA
7       C      G
8    F_BA  H_FBA
9    F_CA  H_FCA
10      G  H_GCA
11      G      I
ggorlen
  • 44,755
  • 7
  • 76
  • 106
  • Very sophisticated approach, thank you! What is if __name__ == "__main__":... doing? – Grapheneer May 09 '20 at 21:26
  • 1
    No problem. Thanks for the interesting question. https://stackoverflow.com/questions/419163/what-does-if-name-main-do – ggorlen May 09 '20 at 21:28