3

I have a tree of categories represented by the following.

import pandas as pd

asset_tree = [
    {'id': 1, 'name': 'Linear Asset', 'parent_id': -1},
    {'id': 2, 'name': 'Lateral', 'parent_id': 1},
    {'id': 3, 'name': 'Main', 'parent_id': 1},
    {'id': 4, 'name': 'Point Asset', 'parent_id': -1},
    {'id': 5, 'name': 'Fountain', 'parent_id': 4},
    {'id': 6, 'name': 'Hydrant', 'parent_id': 4}
]
tree = pd.DataFrame(asset_tree)
print(tree)

This gives me a dataframe as follows:

   id          name  parent_id
0   1  Linear Asset         -1
1   2       Lateral          1
2   3          Main          1
3   4   Point Asset         -1
4   5      Fountain          4
5   6       Hydrant          4

The highest nodes in the tree have parent_id equal to -1, and so the tree can graphically be represented as follows:

Linear Asset
   | - Lateral
   | - Main
Point Asset
   | - Fountain
   | - Hydrant

I need to generate the following dataframe.

   id          name  parent_id  flat_name
0   1  Linear Asset         -1  Linear Asset
1   2       Lateral          1  Linear Asset : Lateral
2   3          Main          1  Linear Asset : Main
3   4   Point Asset         -1  Point Asset
4   5      Fountain          4  Point Asset : Fountain
5   6       Hydrant          4  Point Asset : Hydrant

The tree is generated dynamically and can have any number of levels and so the following tree

asset_tree = [
    {'id': 1, 'name': 'Linear Asset', 'parent_id': -1},
    {'id': 2, 'name': 'Lateral', 'parent_id': 1},
    {'id': 3, 'name': 'Main', 'parent_id': 1},
    {'id': 4, 'name': 'Point Asset', 'parent_id': -1},
    {'id': 5, 'name': 'Fountain', 'parent_id': 4},
    {'id': 6, 'name': 'Hydrant', 'parent_id': 4},
    {'id': 7, 'name': 'Steel', 'parent_id': 2},
    {'id': 8, 'name': 'Plastic', 'parent_id': 2},
    {'id': 9, 'name': 'Steel', 'parent_id': 3},
    {'id': 10, 'name': 'Plastic', 'parent_id': 3}
]

should result in the following:

   id          name  parent_id  flat_name
0   1  Linear Asset         -1  Linear Asset
1   2       Lateral          1  Linear Asset : Lateral
2   3          Main          1  Linear Asset : Main
3   4   Point Asset         -1  Point Asset
4   5      Fountain          4  Point Asset : Fountain
5   6       Hydrant          4  Point Asset : Hydrant
6   7         Steel          2  Linear Asset : Lateral : Steel
7   8       Plastic          2  Linear Asset : Lateral : Plastic
8   9         Steel          3  Linear Asset : Main : Steel
9  10       Plastic          3  Linear Asset : Main : Plastic
Sid Kwakkel
  • 749
  • 3
  • 11
  • 31

3 Answers3

5

Here is a recursive apply function for accomplishing this. The function takes in an id and returns its "path" through the tree:

def flatname(ID):
    row = df[df['id'] == ID].squeeze()
    if row['parent_id'] == -1:
        return row['name']
    else:
        return flatname(row['parent_id']) + ' : ' + row['name']

To use, call:

df['flat_name'] = df['id'].apply(flatname)

The df after used on your second example:

   id          name  parent_id                         flat_name
0   1  Linear Asset         -1                      Linear Asset
1   2       Lateral          1            Linear Asset : Lateral
2   3          Main          1               Linear Asset : Main
3   4   Point Asset         -1                       Point Asset
4   5      Fountain          4            Point Asset : Fountain
5   6       Hydrant          4             Point Asset : Hydrant
6   7         Steel          2    Linear Asset : Lateral : Steel
7   8       Plastic          2  Linear Asset : Lateral : Plastic
8   9         Steel          3       Linear Asset : Main : Steel
9  10       Plastic          3     Linear Asset : Main : Plastic

OP noted that the above function refers explicitly to the df variable defined outside of the function-scope. So if you call your DataFrame something different, or you want to call this on many DataFrames, this could cause problems. One fix would be to turn the apply function into more of a private helper, and create an external (more user-friendly) function which calls it:

def _flatname_recurse(ID, df):
    row = df[df['id'] == ID].squeeze()
    if row['parent_id'] == -1:
        return row['name']
    else:
        return _flatname_recurse(row['parent_id'], df=df) + ' : ' + row['name']

# asset_df to specify we are looking for a specific kind of df
def flatnames(asset_df):
    return asset_df['id'].apply(_flatname_recurse, df=asset_df)

Then call with:

df['flat_name'] = flatnames(df)

Also, note that I used to use row = df.iloc[ID - 1, :] for identifying the row, which worked in this case but was dependent on the id being one greater than the index. This approach is more general.

Tom
  • 8,310
  • 2
  • 16
  • 36
  • It is weird to reference the `df` dataframe inside the function that you are using in your apply function. It means that the function is tied to the dataframe that uses it. That said, the answer works for both cases. – Sid Kwakkel Mar 24 '21 at 16:00
  • 1
    @SidKwakkel I see what you mean (i.e. the reference to the `df` variable is specific to the preceding code). You could add a `df` parameter to `flatname`, and then call with `df['id'].apply(flatname, df=df)`. This also feels syntactically odd, but may be nice if you are applying to many DataFrames. – Tom Mar 24 '21 at 16:11
  • Or perhaps more user-friendly, create an external function (of a DataFrame) which calls this recursive function – Tom Mar 24 '21 at 16:12
  • I ended up using this solution because of increased readability. Al of the submissions to-date would have worked. – Sid Kwakkel Mar 24 '21 at 20:48
3

This is a network problem, try networkx:

import networkx as nx

# build the graph
G = nx.from_pandas_edgelist(tree, source='parent_id', target='id',
                            create_using=nx.DiGraph)

# map id to name
node_names = tree.set_index('id')['name'].to_dict()

# get path from root (-1) to the node
def get_path(node):
    # this is a tree, so exactly one simple path for each node
    for path in nx.simple_paths.all_simple_paths(G, -1, node):
        return ' : '.join(node_names.get(i) for i in path[1:])

tree['flat_name'] = tree['id'].apply(get_path)

Output:

   id          name  parent_id                         flat_name
0   1  Linear Asset         -1                      Linear Asset
1   2       Lateral          1            Linear Asset : Lateral
2   3          Main          1               Linear Asset : Main
3   4   Point Asset         -1                       Point Asset
4   5      Fountain          4            Point Asset : Fountain
5   6       Hydrant          4             Point Asset : Hydrant
6   7         Steel          2    Linear Asset : Lateral : Steel
7   8       Plastic          2  Linear Asset : Lateral : Plastic
8   9         Steel          3       Linear Asset : Main : Steel
9  10       Plastic          3     Linear Asset : Main : Plastic
Quang Hoang
  • 146,074
  • 10
  • 56
  • 74
  • I can confirm that this answer works, but I have favored another answer because I want to reduce the dependency on libraries. – Sid Kwakkel Mar 24 '21 at 15:46
2

You can use recursion to find the path to the parent id:

import pandas as pd
asset_tree = [{'id': 1, 'name': 'Linear Asset', 'parent_id': -1}, {'id': 2, 'name': 'Lateral', 'parent_id': 1}, {'id': 3, 'name': 'Main', 'parent_id': 1}, {'id': 4, 'name': 'Point Asset', 'parent_id': -1}, {'id': 5, 'name': 'Fountain', 'parent_id': 4}, {'id': 6, 'name': 'Hydrant', 'parent_id': 4}]
a_tree = {i['id']:i for i in asset_tree} #to dictionary for more efficient lookup
def get_parent(d, c = []):
   if (k:=a_tree.get(d['parent_id'])) is None:
      return c + [d['name']]
   return get_parent(k, c+[d['name']])

r = [{**i, 'flat_name':' : '.join(get_parent(i)[::-1])} for i in asset_tree]
df = pd.DataFrame(r)

Output:

    id         name  parent_id               flat_name
0   1  Linear Asset         -1            Linear Asset
1   2       Lateral          1  Linear Asset : Lateral
2   3          Main          1     Linear Asset : Main
3   4   Point Asset         -1             Point Asset
4   5      Fountain          4  Point Asset : Fountain
5   6       Hydrant          4   Point Asset : Hydrant

On your larger asset_tree:

asset_tree = [{'id': 1, 'name': 'Linear Asset', 'parent_id': -1}, {'id': 2, 'name': 'Lateral', 'parent_id': 1}, {'id': 3, 'name': 'Main', 'parent_id': 1}, {'id': 4, 'name': 'Point Asset', 'parent_id': -1}, {'id': 5, 'name': 'Fountain', 'parent_id': 4}, {'id': 6, 'name': 'Hydrant', 'parent_id': 4}, {'id': 7, 'name': 'Steel', 'parent_id': 2}, {'id': 8, 'name': 'Plastic', 'parent_id': 2}, {'id': 9, 'name': 'Steel', 'parent_id': 3}, {'id': 10, 'name': 'Plastic', 'parent_id': 3}]
a_tree = {i['id']:i for i in asset_tree}
r = [{**i, 'flat_name':' : '.join(get_parent(i)[::-1])} for i in asset_tree]
df = pd.DataFrame(r)

Output:

   id          name  parent_id                         flat_name
0   1  Linear Asset         -1                      Linear Asset
1   2       Lateral          1            Linear Asset : Lateral
2   3          Main          1               Linear Asset : Main
3   4   Point Asset         -1                       Point Asset
4   5      Fountain          4            Point Asset : Fountain
5   6       Hydrant          4             Point Asset : Hydrant
6   7         Steel          2    Linear Asset : Lateral : Steel
7   8       Plastic          2  Linear Asset : Lateral : Plastic
8   9         Steel          3       Linear Asset : Main : Steel
9  10       Plastic          3     Linear Asset : Main : Plastic
Ajax1234
  • 69,937
  • 8
  • 61
  • 102
  • `(k:=a_tree.get(d['parent_id']))` is some strange Python syntax I never seen before. Is it a special package? – Quang Hoang Mar 24 '21 at 15:39
  • @QuangHoang That is an [assignment expression](https://www.python.org/dev/peps/pep-0572/), available in Python 3.8 and higher. – Ajax1234 Mar 24 '21 at 15:40
  • @Ajax1234 ah thanks. I’m still sticking with 3.7. – Quang Hoang Mar 24 '21 at 15:41
  • I can confirm that this answer works, but I have favored another answer because I believe the readability of the other is better. – Sid Kwakkel Mar 24 '21 at 15:44
  • 2
    @SidKwakkel I'd recommend this one over Tom's, because it does not need the parent row to appear & processed before the children. Just think the case a parent is actually next row of a child. The efficient is actually not optimal in Ajax's code and I believe it is O(N^2) by the way the name array is constructed. This probably can be done in O(N) if you work under a graph theory mindset. – Bing Wang Mar 24 '21 at 16:30