Given a dataframe with one column of players and other column with a subset of teammates, form the entire teams

Question

Suppose I have a dataframe like this

    player  teammates
0   A       [C,F]
1   C       [A,F]
2   B       [B]
3   D       [H,J,K]
4   H       [J,K]
5   Q       [D]

Now rows 3, 4 and 5 represent some challenging data points. If the teammates column contained the entire team for each player, the problem would be trivial.

The expected output would be a list of all teams, so like:

[[A,C,F], [B], [D,H,J,K,Q]]

The first step could be to just consolidate both columns into one via

df.apply(lambda row: list(set([row['player']]+row['teammates'])), axis=1), like so

0  [A,C,F]
1  [A,C,F]
2  [B]
3  [D,H,J,K]
4  [H,J,K]
5  [Q,D]

but checking pairwise for common elements and further consolidating seems very inefficient. Is there an efficient way to get the desired output?

Have you tested to see just how "inefficient" your pairwise solution is? — Scott Hunter, Jan 21 '22 at 13:17
@ScottHunter: Not yet, but intuitively that seems to be the case since there are 100's of thousands of rows :( — user9343456, Jan 21 '22 at 13:18
"Most efficient" is not necessarily the same as "efficient enough". — Scott Hunter, Jan 21 '22 at 13:20

jezrael · Accepted Answer · 2022-01-21T13:28:29.123

1

Create connected_components with explode column teammates by DataFrame.explode:

import networkx as nx

# Create the graph from the dataframe
g = nx.Graph()

g.add_edges_from(df[['player','teammates']].explode('teammates').itertuples(index=False))

new = list(nx.connected_components(g))
print (new)
[{'F', 'A', 'C'}, {'B'}, {'Q', 'K', 'H', 'J', 'D'}]

If need lists:

L = [list(x) for x in new]

edited Jan 21 '22 at 13:28

answered Jan 21 '22 at 13:18

jezrael

822,522
95
1,334
1,252

One follow-up question - for each set in `new`, is there a way to recognize one element in the set that is in the original `player` column? e.g. in the output for `print (new)` as you gave above, the output list picking one set element that is in the `player` column would like like `['A', 'B', 'Q']`. The naive way could be like `[list(set(df['player']).intersection(x))[0] for x in new]`. Is this an efficient way or is there a better method in the networkx library that would be faster? – user9343456 Feb 24 '22 at 08:42
@user9343456 - pandas method should be `df[~df['teammates'].str[0].map({x:i for i, v in enumerate(new) for x in v}).duplicated()]` – jezrael Feb 24 '22 at 09:12

Given a dataframe with one column of players and other column with a subset of teammates, form the entire teams

1 Answers1