How to look up all ancestors in pandas dataframe and calculate the average of them (including the original user)

Question

I have run into an issue. I created a game where I have millions of users. Each user is be default their own "team leader" but are allowed to change their leader to any other user. It's possible for multiple users to create a circular dependency here. So I have a large dataframe that looks like (here is an example):

user_id | leader_id | power
---|---|----
1 | 1 | 10
2 | 3 | 20
3 | 2 | 30
4 | 5 | 40
5 | 5 | 50

I am not interested in the full teams, only the people below you in rank on your team. So the "teams" I'm interested in should look like (note, user 4 is on their own team since they lead no one but user 5 has a team of [4, 5] since they lead user 4).

user_id | leader_id | power | team
---|---|----|----
1 | 1 | 10 | [1]
2 | 3 | 20 | [2, 3]
3 | 2 | 30 | [2, 3]
4 | 5 | 40 | [4]
5 | 5 | 50 | [4, 5]

Now, all I want to do is create a column which is the average power level of your team. It sounds simple but I can't get it to run at all fast on my millions of rows of data. I tried using networkx but it basically steps through each person and re-calculates the entire tree each time. I tried writing my own function that caches it but I'm still struggling. I wrote an example script with tests. I would be very grateful if anyone could help me. I am open to any other methods of calculating this. When I run this script the networkx approach takes about ~53 seconds while the other approach takes about ~38 seconds. I am hoping to get it much faster since this is only on 50,000 rows of sample data.

import time

import networkx as nx
import numpy as np
import pandas as pd

from tests.pandas_test_utils import assert_frame_equal


def get_team_power_using_networkx(df):
    G = nx.from_pandas_edgelist(df, source='user_id', target='leader_id', create_using=nx.DiGraph())
    ancestors = {node: nx.ancestors(G, node) for node in G.nodes()}

    user_ids = df['user_id'].values

    team_ids_list = [list(ancestors.get(user_id, set()) | {user_id}) for user_id in user_ids]

    hierarchal_scores = [df.loc[df['user_id'].isin(team_ids), "power"].dropna().values for team_ids in team_ids_list]
    avg_scores = [np.sum(scores) / len(scores) if len(scores) > 0 else np.nan for scores in hierarchal_scores]
    df["team_power"] = np.where(np.isnan(avg_scores), np.nan, np.round(avg_scores).astype(int))

    return df


def get_team_power_using_custom(df):
    G = nx.from_pandas_edgelist(df, source='user_id', target='leader_id', create_using=nx.DiGraph())

    def get_ancestors(node):
        if node in ancestors:
            return ancestors[node]
        ancestors[node] = set()
        for parent in G.predecessors(node):
            ancestors[node].add(parent)
            ancestors[node].update(get_ancestors(parent))
        return ancestors[node]

    ancestors = {}
    user_ids = df['user_id'].values

    team_ids_list = [list(get_ancestors(user_id) | {user_id}) for user_id in user_ids]

    hierarchal_scores = [df.loc[df['user_id'].isin(team_ids), "power"].dropna().values for team_ids in team_ids_list]
    avg_scores = [np.sum(scores) / len(scores) if len(scores) > 0 else np.nan for scores in hierarchal_scores]
    df["team_power"] = np.where(np.isnan(avg_scores), np.nan, np.round(avg_scores).astype(int))

    return df


## Test if it works
"""
user 1 is alone. team is [1] and average power is [10]
user 2 and 3 have each other as leaders. both have teams consisting of [2, 3] and average power is [25]
user 4 reports to user 5. team is [4] and average power is [40]
user 5 reports to themself. team is [4, 5, 6, 7] and average power is [53] -> (40 + 50 + 70) / 3
user 6 reports to user 5. team is [6, 7] and average power is [70]
user 7 reports to user 6 who reports to user 5. team is [7] and average power is [70]
user 8 reports to themself. team is [8] and average power is [np.nan]
"""
user_id = [1, 2, 3, 4, 5, 6, 7, 8]
leader_id = [1, 3, 2, 5, 5, 5, 6, 8]
power = [10, 20, 30, 40, 50, np.nan, 70, np.nan]
# Create the DataFrame
data = {'user_id': user_id, 'leader_id': leader_id, 'power': power, }
df = pd.DataFrame(data)
for func in [get_team_power_using_networkx, get_team_power_using_custom]:
    print(f"Testing: {func.__name__}")
    results = func(df)[["user_id", "team_power"]]
    expected_results = pd.DataFrame({"user_id": [1, 2, 3, 4, 5, 6, 7, 8], "team_power": [10, 25, 25, 40, 53, 70, 70, np.nan]})
    assert_frame_equal(results, expected_results)

## Test speed
num_rows = 50000
np.random.seed(42)
# Generate random values for each column
user_id = np.random.randint(1, 1000, size=num_rows)
leader_id = np.random.randint(1, 1000, size=num_rows)
power = np.random.uniform(0, 100, size=num_rows)
# Create the DataFrame
data = {'user_id': user_id, 'leader_id': leader_id, 'power': power, }
df = pd.DataFrame(data)
for func in [get_team_power_using_networkx, get_team_power_using_custom]:
    start_time = time.time()  # Record the start time
    func(df)
    end_time = time.time()  # Record the end time
    print(f"Time to run {func.__name__}:", end_time - start_time)

Andrej Kesely · Accepted Answer · 2023-07-19T20:11:17.860

IIUC, you can do:

groups = df.groupby('leader_id')['user_id'].agg(set).to_dict()

df['team'] = df['user_id'].apply(lambda x: {x} | groups.get(x, set()))
df['team_power'] = df['team'].apply(lambda x: df.loc[df['user_id'].isin(x), 'power'].mean())
print(df)

Prints:

   user_id  leader_id  power       team  team_power
0        1          1   10.0        {1}        10.0
1        2          3   20.0     {2, 3}        25.0
2        3          2   30.0     {2, 3}        25.0
3        4          5   40.0        {4}        40.0
4        5          5   50.0  {4, 5, 6}        45.0
5        6          5    NaN        {6}         NaN

Initial df:

   user_id  leader_id  power
0        1          1   10.0
1        2          3   20.0
2        3          2   30.0
3        4          5   40.0
4        5          5   50.0
5        6          5    NaN

UPDATE: To calculate the team recursively:

def get_team(team):
    for user in df.loc[df['leader_id'].isin(team), 'user_id']:
        if user not in team:
            team |= get_team(team | {user})
    return team

df['team'] = [get_team({u}) for u in df['user_id']]
df['team_power'] = df['team'].apply(lambda x: df.loc[df['user_id'].isin(x), 'power'].mean())
print(df)

Prints:

   user_id  leader_id  power          team  team_power
0        1          1   10.0           {1}   10.000000
1        2          3   20.0        {2, 3}   25.000000
2        3          2   30.0        {2, 3}   25.000000
3        4          5   40.0           {4}   40.000000
4        5          5   50.0  {4, 5, 6, 7}   53.333333
5        6          5    NaN        {6, 7}   70.000000
6        7          6   70.0           {7}   70.000000
7        8          8    NaN           {8}         NaN

@elthran Can you please update your question and put there example with the edge cases + expected output? — Andrej Kesely, Jul 19 '23 at 19:47
This passes the tests but fails on random larger data sets due to infinite recursion due to circular loops over leadership — elthran, Jul 19 '23 at 21:05
@elthran Can you please update your question and put there example where the infinite recursion happens? — Andrej Kesely, Jul 19 '23 at 21:15
If I add `np.random.seed(42)` the second test with 50,000 rows will immediately get `RecursionError: maximum recursion depth exceeded` but otherwise it works — elthran, Jul 20 '23 at 01:11
@elthran You can try to increase the recursion limit https://stackoverflow.com/questions/3323001/what-is-the-maximum-recursion-depth-and-how-to-increase-it — Andrej Kesely, Jul 20 '23 at 01:15

How to look up all ancestors in pandas dataframe and calculate the average of them (including the original user)

1 Answers1