I have run into an issue. I created a game where I have millions of users. Each user is be default their own "team leader" but are allowed to change their leader to any other user. It's possible for multiple users to create a circular dependency here. So I have a large dataframe that looks like (here is an example):
user_id | leader_id | power
---|---|----
1 | 1 | 10
2 | 3 | 20
3 | 2 | 30
4 | 5 | 40
5 | 5 | 50
I am not interested in the full teams, only the people below you in rank on your team. So the "teams" I'm interested in should look like (note, user 4 is on their own team since they lead no one but user 5 has a team of [4, 5] since they lead user 4).
user_id | leader_id | power | team
---|---|----|----
1 | 1 | 10 | [1]
2 | 3 | 20 | [2, 3]
3 | 2 | 30 | [2, 3]
4 | 5 | 40 | [4]
5 | 5 | 50 | [4, 5]
Now, all I want to do is create a column which is the average power level of your team. It sounds simple but I can't get it to run at all fast on my millions of rows of data. I tried using networkx but it basically steps through each person and re-calculates the entire tree each time. I tried writing my own function that caches it but I'm still struggling. I wrote an example script with tests. I would be very grateful if anyone could help me. I am open to any other methods of calculating this. When I run this script the networkx approach takes about ~53 seconds while the other approach takes about ~38 seconds. I am hoping to get it much faster since this is only on 50,000 rows of sample data.
import time
import networkx as nx
import numpy as np
import pandas as pd
from tests.pandas_test_utils import assert_frame_equal
def get_team_power_using_networkx(df):
G = nx.from_pandas_edgelist(df, source='user_id', target='leader_id', create_using=nx.DiGraph())
ancestors = {node: nx.ancestors(G, node) for node in G.nodes()}
user_ids = df['user_id'].values
team_ids_list = [list(ancestors.get(user_id, set()) | {user_id}) for user_id in user_ids]
hierarchal_scores = [df.loc[df['user_id'].isin(team_ids), "power"].dropna().values for team_ids in team_ids_list]
avg_scores = [np.sum(scores) / len(scores) if len(scores) > 0 else np.nan for scores in hierarchal_scores]
df["team_power"] = np.where(np.isnan(avg_scores), np.nan, np.round(avg_scores).astype(int))
return df
def get_team_power_using_custom(df):
G = nx.from_pandas_edgelist(df, source='user_id', target='leader_id', create_using=nx.DiGraph())
def get_ancestors(node):
if node in ancestors:
return ancestors[node]
ancestors[node] = set()
for parent in G.predecessors(node):
ancestors[node].add(parent)
ancestors[node].update(get_ancestors(parent))
return ancestors[node]
ancestors = {}
user_ids = df['user_id'].values
team_ids_list = [list(get_ancestors(user_id) | {user_id}) for user_id in user_ids]
hierarchal_scores = [df.loc[df['user_id'].isin(team_ids), "power"].dropna().values for team_ids in team_ids_list]
avg_scores = [np.sum(scores) / len(scores) if len(scores) > 0 else np.nan for scores in hierarchal_scores]
df["team_power"] = np.where(np.isnan(avg_scores), np.nan, np.round(avg_scores).astype(int))
return df
## Test if it works
"""
user 1 is alone. team is [1] and average power is [10]
user 2 and 3 have each other as leaders. both have teams consisting of [2, 3] and average power is [25]
user 4 reports to user 5. team is [4] and average power is [40]
user 5 reports to themself. team is [4, 5, 6, 7] and average power is [53] -> (40 + 50 + 70) / 3
user 6 reports to user 5. team is [6, 7] and average power is [70]
user 7 reports to user 6 who reports to user 5. team is [7] and average power is [70]
user 8 reports to themself. team is [8] and average power is [np.nan]
"""
user_id = [1, 2, 3, 4, 5, 6, 7, 8]
leader_id = [1, 3, 2, 5, 5, 5, 6, 8]
power = [10, 20, 30, 40, 50, np.nan, 70, np.nan]
# Create the DataFrame
data = {'user_id': user_id, 'leader_id': leader_id, 'power': power, }
df = pd.DataFrame(data)
for func in [get_team_power_using_networkx, get_team_power_using_custom]:
print(f"Testing: {func.__name__}")
results = func(df)[["user_id", "team_power"]]
expected_results = pd.DataFrame({"user_id": [1, 2, 3, 4, 5, 6, 7, 8], "team_power": [10, 25, 25, 40, 53, 70, 70, np.nan]})
assert_frame_equal(results, expected_results)
## Test speed
num_rows = 50000
np.random.seed(42)
# Generate random values for each column
user_id = np.random.randint(1, 1000, size=num_rows)
leader_id = np.random.randint(1, 1000, size=num_rows)
power = np.random.uniform(0, 100, size=num_rows)
# Create the DataFrame
data = {'user_id': user_id, 'leader_id': leader_id, 'power': power, }
df = pd.DataFrame(data)
for func in [get_team_power_using_networkx, get_team_power_using_custom]:
start_time = time.time() # Record the start time
func(df)
end_time = time.time() # Record the end time
print(f"Time to run {func.__name__}:", end_time - start_time)