First, I grouped the dataframe on the club to get a set of each person in the club.
grouped = df.groupby("Club").agg({"Person": set}).reset_index()
Club Person
0 A {1, 2, 3}
1 B {1, 3}
2 C {1, 2, 4}
Then, I created a Cartesian product of this dataframe. I didn't have pandas 1.2.0, so I couldn't use the cross join available in df.merge()
. Instead, I used the idea from this answer: pandas two dataframe cross join
grouped["key"] = 0
product = grouped.merge(grouped, on="key", how="outer").drop(columns="key")
Club_x Person_x Club_y Person_y
0 A {1, 2, 3} A {1, 2, 3}
1 A {1, 2, 3} B {1, 3}
2 A {1, 2, 3} C {1, 2, 4}
3 B {1, 3} A {1, 2, 3}
4 B {1, 3} B {1, 3}
5 B {1, 3} C {1, 2, 4}
6 C {1, 2, 4} A {1, 2, 3}
7 C {1, 2, 4} B {1, 3}
8 C {1, 2, 4} C {1, 2, 4}
I then filtered out pairs where Club_x < Club_y
so it removes duplicate pairs.
filtered = product[product["Club_x"] < product["Club_y"]]
Club_x Person_x Club_y Person_y
1 A {1, 2, 3} B {1, 3}
2 A {1, 2, 3} C {1, 2, 4}
5 B {1, 3} C {1, 2, 4}
Finally, I added the column with the overlap size and renamed columns as necessary.
result = filtered.assign(Num_Overlaps=filtered.apply(lambda row: len(row["Person_x"].intersection(row["Person_y"])), axis=1))
result = result.rename(columns={"Club_x": "Club 1", "Club_y": "Club 2"}).drop(["Person_x", "Person_y"], axis=1)
Club 1 Club 2 Num_Overlaps
1 A B 2
2 A C 2
5 B C 1