0

I have a dataframe with four columns: parent_serialno, child_serialno, parent_function, and child_function. I would like to construct a dataframe where each row is a root parent and each column is a function with the values being the serial number for that function.

For example, the dataframe looks like this:

df = pd.DataFrame(
    [['001', '010', 'A', 'B'], ['001', '020', 'A', 'C'], ['010', '100', 'B', 'D'], ['100', '110', 'D', 'E'],
     ['002', '030', 'A', 'B'], ['002', '040', 'A', 'C']],
    columns=['parent_serialno', 'child_serialno', 'parent_function', 'child_function'])

Note that not all functions contain a descendant for every root, but there is only one serial number for each function for a given root. The root serial numbers are known ahead of time.

What I would like to output looks like a dataframe like:


pd.DataFrame([['001','010','020','100','110'],['002','030','040', np.nan, np.nan]], columns = ['A','B','C','D','E'])

Out[1]: 
     A    B    C    D    E
0  001  010  020  100  110
1  002  030  040  NaN  NaN

This post shows how to get a dictionary hierarchy, but I'm less concerned about identifying the location of a leaf in the tree (i.e. grandchild vs great-grandchild) and more concerned with just identifying the root and function of each leaf.

Dani Mesejo
  • 61,499
  • 6
  • 49
  • 76
Kyle
  • 67
  • 5

1 Answers1

0

Use networkx to solve this problem:

# Python env: pip install networkx
# Anaconda env: conda install networkx

# Create a list of tuples of serialno / function
df['parent'] = df[['parent_function', 'parent_serialno']].apply(tuple, axis=1)
df['child'] = df[['child_function', 'child_serialno']].apply(tuple, axis=1)

# Create a directed graph from dataframe
G = nx.from_pandas_edgelist(df, source='parent', target='child', 
                            create_using=nx.DiGraph)

# Find roots and leaves
roots = [node for node, degree in G.in_degree() if degree == 0]
leaves = [node for node, degree in G.out_degree() if degree == 0]

# Find all path from each root to each leaf
paths = {}
for root in roots:
    children = paths.setdefault(root, [])
    for leaf in leaves:
        for path in nx.all_simple_paths(G, root, leaf):
            children.extend(path[1:])
    children.sort(key=lambda x: x[1])

# Create your final output
out = pd.DataFrame([dict([parent] + children) for parent, children in paths.items()])

Output:

>>> out
     A    B    C    D    E
0  001  010  020  100  110
1  002  030  040  NaN  NaN
Corralien
  • 109,409
  • 8
  • 28
  • 52