1

Suppose I have a dataframe like this

    player  teammates
0   A       [C,F]
1   C       [A,F]
2   B       [B]
3   D       [H,J,K]
4   H       [J,K]
5   Q       [D]

Now rows 3, 4 and 5 represent some challenging data points. If the teammates column contained the entire team for each player, the problem would be trivial.

The expected output would be a list of all teams, so like:

[[A,C,F], [B], [D,H,J,K,Q]]

The first step could be to just consolidate both columns into one via

df.apply(lambda row: list(set([row['player']]+row['teammates'])), axis=1), like so

0  [A,C,F]
1  [A,C,F]
2  [B]
3  [D,H,J,K]
4  [H,J,K]
5  [Q,D]

but checking pairwise for common elements and further consolidating seems very inefficient. Is there an efficient way to get the desired output?

user9343456
  • 351
  • 2
  • 11

1 Answers1

1

Create connected_components with explode column teammates by DataFrame.explode:

import networkx as nx

# Create the graph from the dataframe
g = nx.Graph()

g.add_edges_from(df[['player','teammates']].explode('teammates').itertuples(index=False))

new = list(nx.connected_components(g))
print (new)
[{'F', 'A', 'C'}, {'B'}, {'Q', 'K', 'H', 'J', 'D'}]

If need lists:

L = [list(x) for x in new]
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • One follow-up question - for each set in `new`, is there a way to recognize one element in the set that is in the original `player` column? e.g. in the output for `print (new)` as you gave above, the output list picking one set element that is in the `player` column would like like `['A', 'B', 'Q']`. The naive way could be like `[list(set(df['player']).intersection(x))[0] for x in new]`. Is this an efficient way or is there a better method in the networkx library that would be faster? – user9343456 Feb 24 '22 at 08:42
  • @user9343456 - pandas method should be `df[~df['teammates'].str[0].map({x:i for i, v in enumerate(new) for x in v}).duplicated()]` – jezrael Feb 24 '22 at 09:12